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ABSTRACT 


Reverse proxy servers are valuable assets to defend outside hosts from seeing the internal net¬ 
work strueture upon whieh the reverse proxy is serving. They are frequently used to proteet 
valuable files, systems, and internal users from external users while still providing serviees to 
outside hosts. Another aspeet of reverse proxies is that they ean be installed remotely by mali- 
eious aetors onto eompromised maehines in order to serviee malieious eontent while masking 
where the eontent is truly hosted. Reverse proxies internet over the HyperText Transfer Protoeol 
(HTTP), whieh is delivered via the Transmission Control Protoeol (TCP). TCP flows provide 
various details regarding eonneetions between an end host and a server. One sueh detail is the 
timestamp of eaeh paeket delivery. Coneurrent timestamps may be used to ealeulate round trip 
times with some serutiny. Previous work in timing analysis suggests that aetive HTTP probes 
to servers ean be analyzed at the originating host in order to elassify servers as reverse proxies 
or otherwise. We eolleet TCP session data from a variety of global vantage points, aetively 
probing a list of servers with a goal of developing an effeetive elassifier to diseern whether eaeh 
server is a reverse proxy or not based on the timing of paeket round trip times. 
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Executive Summary 


A reverse proxy server handles incoming network requests by maintaining two connections; the 
connection from the requesting host external to the network, and a second connection with the 
data storage host internal to the network. The requesting host only sees the traffic between itself 
and the reverse proxy server, and the internally networked, data storage machine only sees the 
traffic between itself and the reverse proxy server. This hidden traffic concept supports network 
access control, security protection through obfuscation, and performance boosts at the Internet 
facing access points for handling thousands of incoming requests. 

Reverse proxies are used for a variety of reasons. One such way they are used is for security 
and information hiding. Nefariously, botnet controllers use numerous ways to mask their origin 
and implementation in order to maintain operations, whether the task is spam, denial-of-service 
attacks, or otherwise. Additionally, home users and large businesses alike implement security 
measures to protect the data that they serve. Reverse proxies hide the network assets on the non- 
Internet facing routes. This is helpful in two major ways; providing a single source of traffic 
control for monitoring network activity and hiding the internal network infrastructure from 
the requesting host(s). Another way reverse proxies are used is for performance boosts. As 
a performance boost, reverse proxies are employed by Content Distribution Networks (CDNs) 
such as Akamai or web-hosting services to support fast delivery of web-based content for online 
companies by routing host requests to their reverse proxy servers. 

Webservers such as Apache and Nginx provide high performance, reverse proxy solutions. 
It is easy to see that highly trafficked Internet sites can increase their client throughput by 
utilizing reverse proxy servers that increase their potential for sales or advertising. From a 
more malicious viewpoint, botnets pose a serious threat to network bandwidth, home users, 
companies, corporations, and governments. Millions of dollars are spent annually to deter 
botnet operations such as spam and denial-of-service attacks from harming Internet users and 
service providers. The ability to identify machines that are a part of a botnet or botnets can help 
deter or neutralize botnet attacks, either by blocking network connectivity between the infected 
host and the Webserver or by taking the infected host offline through legal means. As stated 
previously, reverse proxy servers are also helpful for maintaining network infrastructure from a 
managerial perspective. Researchers studying network edges may find it helpful to know that a 
single destination address is actually a reverse proxy, handling HTTP requests for any number of 
content providers. Probing techniques and research results could be tailored to handle reverse 
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proxies once they are identified as such. If my advisors and I are able to correlate timing 
analysis with reverse proxy servers to a high enough level of confidence, we could distribute 
the method to security groups or research organizations. The concept could be used to perform 
self-observational security penetration tests, analyzing CDNs and DNS lookup correlation(s), 
or to map out networks for military use, similar to Intelligence Preparation of the Battlespace 
(IPB) applied to network warfare. 
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CHAPTER 1: 
INTRODUCTION 


As the number of malicious actions increases daily on the Internet, so must defenses for com¬ 
panies, governments, and even home users. Since the early 2000s, there has been a significant 
expansion in the number of network security services to combat malicious network operations. 
Those who conduct malicious acts on the Internet have learned that they must defend their own 
platforms as well. Reverse proxies can be used by malicious actors as well as trusted agents. 
Trusted agents and security professionals would prefer to deter malicious activities before they 
occur. Network researchers may also wish to map the topology of a network. The ability to 
identify system characteristics and network setups could contribute to identifying malicious ori¬ 
gins, leading to a better network understanding for security professionals and researchers alike. 
Therefore it is important, from both the defense and offense perspectives, to characterize the 
behaviors of server types. More specifically, reverse proxies provide several functionalities that 
can be used effectively by trusted agents and malicious actors similarly. Reverse proxy servers 
are assets to defend outside hosts from seeing the internal network structure upon which the 
reverse proxy is serving. Using timing analysis of TCP and HTTP flows, which is the protocol 
upon which reverse proxies operate, there is a wealth of timing information that can translate 
into usable intelligence for detecting the use of reverse proxies by a network domain. 

1.1 Problem Statement 

The ability to check data authenticity is very important given that malicious network attacks 
are occurring more often than once a second on the Internet [I]. How do Internet users recog¬ 
nize when they are connecting to an authentic content provider and not some malicious back¬ 
end server? Unsuspecting users could be vulnerable to man-in-the-middle (MITM) attacks or 
web-based attacks [2] against web browsers and browser plug-ins. There is no straightforward 
answer to this question because of the numerous technicalities introduced by network proto¬ 
cols and structure. Domain names are a vulnerability, for example, because individually they 
provide no stamp or insight into the trustworthiness of the server to which they direct users, 
but they are a pivotal functionality of web browsing. In order to increase performance amongst 
other reasons, content and other service providers use proxies to control the flow of network 
traffic and return content to customers. A widely used type of proxy server used by webservers 
and Content Distribution Networks (CDNs) is the reverse proxy. 
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A reverse proxy server handles incoming network requests by maintaining two connections: 
the connection from the requesting host external to the network, and a second connection with 
the data storage host internal to the network. The requesting host only sees the traffic between 
itself and the reverse proxy server, and the internally networked, data storage machine only sees 
the traffic between itself and the reverse proxy server (See Figure 2.3). This hidden traffic con¬ 
cept supports network access control, security protection through obfuscation, and performance 
boosts at the Internet facing access points for handling thousands of incoming requests. 

Reverse proxies are used primarily for security and information hiding. Nefariously, botnet 
controllers use numerous ways to mask their origin and implementation in order to maintain 
operations, whether the task is spam, denial-of-service attacks, or otherwise [3], [4], [5]. Addi¬ 
tionally, home users and large businesses alike implement security measures to protect the data 
that they serve. Reverse proxies hide the network assets on the non-Internet facing routes. This 
information hiding is helpful in two major ways; providing a single source of traffic control for 
monitoring network activity and hiding the internal network infrastructure from the requesting 
host(s). 

Another way reverse proxies are used is for performance gains. Reverse proxies are em¬ 
ployed by CDNs such as Akamai to support fast delivery of web-based content for online com¬ 
panies by routing host requests to their reverse proxy servers. Reverse proxies are able to cache 
HTTP requests that help increase throughput and lower latency for web clients [6], [7]. Addi¬ 
tionally, Internet and web service providers can use reverse proxies to handle incoming requests 
across web hosting farms [7] handling multitudes of web sites and their content. Webservers 
such as Apache and Nginx provide high performance, reverse proxy solutions to service thou¬ 
sands of requests per second. It is not difficult to see that highly trafficked Internet sites can 
increase their client throughput by utilizing reverse proxy servers which increases their poten¬ 
tial for sales or advertising. 

The ability to reliably identify reverse proxies is valuable to better understand a network 
topology as well as identify possible vector origins for malicious actors. Botnets pose a serious 
threat to network bandwidth, home users, companies, corporations, and governments. Millions 
of dollars are spent annually to deter botnet operations such as spam and denial-of-service 
attacks from harming Internet users and service providers [8], [9] as well as cleaning up the mess 
that botnets leave behind. The ability to identify machines that are a part of a botnet or botnets 
can help deter or neutralize botnet attacks, either by blocking network connectivity between 
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the infected host and the Webserver or by taking the infected host offline through legal means. 
Addtionally, researchers studying network topologies may find it helpful to know that a single 
destination address is actually a reverse proxy, handling numerous requests for any number of 
content providers. Probing techniques and research results could be tailored to handle reverse 
proxies once they are identified as such. 

There are some methods that can be used to detect reverse proxy servers. Each technique 
currently known requires at least one controlled variable in addition to the standard network 
traffic generated. For example, there exists a system [10] that can detect reverse proxies but 
it requires the use of the Extensible Markup Eanguage (XME). This is difficult in practice 
because not all websites employ applets and XME. We desire to fingerprint all reverse proxies 
with minimal limitations. No publicly known techniques have shown a process to detect or 
classify reverse proxies using only the TCP traffic from a remote host, specifically the TCP 
Round Trip Times (RTTs). 


1.2 Research Description and Hypothesis 

Our hypothesis is that we can identify, or fingerprint, reverse proxy servers by analyzing 
the TCP flows between remote hosts under our control and end hosts by comparing the RTTs 
for the three-way handshake to the RTTs for the HTTP traffic. Reverse proxy servers should 
show an increased RTT for some of the HTTP packets since the reverse proxy server has to 
forward the external requests to an internal machine. This may not be the case with caching, 
since reverse proxies are also used to reduce load on internal network webservers. However, we 
hypothesize that the malicious hosts running webservers are unlikely to implement caching with 
their exploited hosts and they often need to fetch some dynamic content for targeted attacks, 
even with caching turned on. We pose the following primary research questions: 

1. Is the timing analysis of TCP flow RTTs sufficient to classify webservers as reverse prox¬ 
ies with reasonable accuracy? 

2. Will RTTs from the initial handshake RTT to establish a TCP connection be equivalent 
for data RTTs for a direct connection? 

3. Can RTT ratios of the GET RTT over the 3WHS RTT for multiple TCP connections to a 
single server be used as a threshold for classifying those servers as direct connections or 
reverse proxies? 

4. How reliable are existing TCP flow tools when identifying and calculating RTTs? 
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1.3 Organization 

Chapter 1 is an introduction to our thesis research. It presents high-level reasoning for pursu¬ 
ing the identification of reverse proxy servers. Additionally, Chapter 1 covers the organization 
and structure for the thesis paper in order to frame the problem and solution method. 

Chapter 2 provides the baseline for the research. There are several sections describing the 
history of key concepts for the study and contributing elements to our research. This chapter 
motivates the purpose and focus of the research topic, and sets up the reader to fully understand 
the methodology and approach in follow on chapters. 

Chapter 3 uses the concepts and fundamental knowledge provided in Chapter 2 as a starting 
point for creating our fingerprinting strategy. We look at several test cases to analyze how we 
can collect network traffic for detailed RTT analysis. Lastly, we develop a methodology for 
collecting flows and their corresponding RTTs. 

The results of our data collection and our analysis are presented in Chapter 4. We look at data 
organization and website groups first. Secondly, we analyze the results of each website group 
with comparison between the groups. The reverse proxy results are provided as a transition into 
fingerprinting characteristics that we use to perform a selection of case studies for individual 
websites. 

Chapter 5 includes the conclusions of the research and future work. We describe the strengths 
and weaknesses, contributions, and future work of our research. 
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CHAPTER 2: 

BACKGROUND AND RELATED WORK 


The Internet has evolved from a simple communication sharing scheme between universi¬ 
ties to a global network of systems. The network communication path established between a 
client and a server on the Internet is directly related to the network architecture which is never 
controlled by a single entity. The architecture is a shared responsibility amongst network ad¬ 
ministrators, autonomous system owners, organizations, communications companies, and so 
forth. One of the methods by which network hosts may gain some control over their network 
path to a server is through the use of proxies. Proxies also provide a level of anonymity for 
clients and servers by serving as an intermediate node in a network path on behalf of the re¬ 
questing host. In regards to the Internet, web content delivered via proxies can be monitored, 
hidden, or rerouted to provide security for either the requesting host or the servicing host. In 
some cases, the service that proxies provide is used for malicious reasons. We want to identify 
the systems that are being used for malicious proxy use by performing active measurements, 
conducting network traffic analysis, and implementing a statistical fingerprinting technique. In 
order to accomplish proxy identification, we first need to understand the previous work. 

2.1 Origins and Fundamental Knowledge 

The original use of the Internet was designed for ease of sharing information between remote 
locations [11]. As the Internet spread globally, privacy and security became more desirable for 
online communication, banking, e-commerce, and data storage to name a few. When new 
technology for privacy and security is introduced for public use, there are groups of people who 
want to break those secure methods. The interests for breaking privacy and security protocols 
and programs are not necssarily malicious. Proving that a tool or system is unbreakable is 
valuable information. One such approach to breaking privacy is the ability to fingerprint a target, 
whether that be a person, system, or otherwise, based on an identifiable set of characteristics. 
The set of characteristics in which we are interested are those affiliated with reverse proxies. 
Reverse proxies are commonly associated with web proxies, services that handle incoming 
HTTP requests to a web server. As such, they are connection-oriented services that utilize the 
HTTP and TCP to handle traffic between external web clients and internal data servers. The 
time it takes a web server directly connected to the open Internet to handle HTTP requests 
should be less time than a reverse proxy should take since the reverse proxy has to handle the 
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front end connection from the web client as well as the back end connection the appropriate data 
server. We will describe the various types of proxies and the types of connections that can occur 
to a web server. There are several methods to make web servers more efficient, such as caching. 
If a reverse proxy uses caching, the time to respond should be reduced since the back end data 
server connection would not be required. We look at the types of caching before identifying 
similar network traffic analysis methodologies to break privacy tools. Lastly, we present several 
fingerprinting methodologies. Fingerprinting looks at the characteristics of connections and 
network traffic analysis to help us identify reverse proxies. 

2.1.1 Connection Types 

There are three primary ways that a host can establish a connection via Internet services 
using TCP and HTTP. These three ways include a direct connection, a redirected connection 
which is essentially a sequence of two or more direct connections, or a proxied connection. 



Figure 2.1: Direct Connection - The host and the server communicate directly with each other 



First Query 
Second Query 


Figure 2.2: Redirected Connection - The host connects to Server A, then gets redirected to Server B 

Direct connections are not common in the present Internet [12]. Redirected connections are 
much more common for a variety of reasons. Webpage images, advertisements, or other por¬ 
tions of a website may be hosted on multiple webservers. These webservers are not necessarily 
in the same location. The requesting host is then required to make more than one direct connec¬ 
tion to retrieve the various portions of a single webpage. Though the retrieval may seem like a 
single web request, the host is actually making several connections to gather all the fractional 
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Connection between Host and Proxy 
^ Connection between Proxyand Server 


Figure 2.3: Proxied Connection - The proxy sends and receives trafFic for the host 

parts to make the whole. Anonymity may also be a faetor in eonneetions, whieh is the ease 
when using proxy serviees. Proxied eonneetions ean be established in multiple ways, whieh 
will be eovered in the next seetion. 

2.1.2 Proxies 

A proxy is defined as an agent or authority who is authorized to aet on behalf of someone 
else [13]. In the ease of networks, we use proxy servers. A proxy server is defined as a server 
that aets as an intermediary for requests from elients seeking resourees from other servers [14]. 
There are various types of proxy servers that are used to faeilitate network traffie on the Internet. 
Proxy servers are helpful in providing anonymity to Internet users by handling traffie requests 
as the originating host. Proxy servers ean enable a user to maintain privaey when requesting 
eontent from a server or hide the identity of a network attaeker, shape traffie based on ineom- 
ing and outgoing eonneetion information, and they ean also proteet preeious resourees for a 
eompany or web serviee from external threats. 

There are several types of proxies regularly used on the Internet. The most widely-used type 
is known as a forwarding proxy (Figure 2.4). A forwarding proxy handles the origin host’s 
request by forwarding the request to the end host on behalf of the requesting host. The end host 
sees the original request eoming from the proxy, so the end host responds with the appropriate 
message to the proxy. The proxy then forwards the end host’s response to the originating host. In 
this ease, the originating host maintains anonymity beeause it only eonneets and eommunieates 
with the forwarding proxy server. The forwarding proxy server maintains two eonneetions; one 
eonneetion between itself and the origin host and a seeond eonneetion between itself and the 
destination host, or server. The destination host only sees the proxy as the originating host. 
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and does not know anything about the aetual originating host. A seeondary proxy type is the 
open proxy (Figure 2.5), whieh is merely a forwarding proxy that handles forwarded traffie at a 
middlepoint in the Internet. 



► Network Traffic between Host and Proxy 
Forwarded traffic between Proxy and Server 

Figure 2.4: Forwarding Proxy - Forwards traffic from an internal network request to a destination on 
the Internet 



^ Network Traffic between internet locations 
via a middle proxy server 


Figure 2.5: Open Proxy - forwards traffic from two hosts on the Internet, but maintains no connections 
between the two 

Another common form of proxy server is the reverse proxy (Figure 2.6), which is occasion¬ 
ally referred to as a web cache [15] or an HTTP accelerator [16]. A reverse proxy takes requests 
external to the network it is serving, and forwards the external requests to the appropriate inter¬ 
nal host or server. The external host likely does not know that its request is being transferred 
beyond the reverse proxy server, therefore providing anonymity to the internal network. The 
internal network may not know the external host that submitted the request because it only sees 
the request coming from the reverse proxy server. The reverse proxy server maintains the two 
connections just like the forwarding proxy, but instead of providing anonymity to the request¬ 
ing host, a reverse proxy server provides anonymity to the end host or hosts. The reverse proxy 
server is generally an application that often works closely with the internal network firewall [17] 
so that external requests can only access the reverse proxy address. 
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There are several other forms of proxies, sueh as caehing and DNS proxies to name a few. 
These are more funetionally speeifie proxy servers and are not pivotal to understanding this 
thesis topic. The only item of note is that reverse proxies can have a dual role as a caching 
proxy if it is customized to do so. This would make the connection appear more like a direct 
connection from the origin host’s point of view. 

There are a variety of software solutions to handle proxy requirements. Most web browsers 
have network options to set forwarding proxy addresses in order to route HTTP traffic for client 
workstations. Some internal networks force this requirement to monitor intranet activities by 
forcing all connected workstations to route their traffic through an administratively controlled 
server before reaching the open Internet. This is mostly for handling outgoing traffic. Con¬ 
versely, reverse proxy software solutions are common to handle network loads and protect 
internal network servers. A few common examples of these reverse proxy programs are Ng- 
inx [18], Apache [19], and Squid [20]. Each of these software packages exhibit similar packet 
handling techniques when routing HTTP requests. Malicious users may create their own reverse 
proxy server solutions to reduce their externally visible footprint. 



► Network Traffic requests to Proxy 
— — — — “ — ► Routed trafficto irrtemal network server 

Figure 2.6: Reverse Proxy - Receives Internet requests and forwards the request to the appropriate 
internal server and back to the requesting Internet host 

2.1.3 Web Browsing and Caching 

Web browsing is the most common use of HTTP. The general concept is that an originating 
host, or web client, establishes a connection with an end host, or web server. The web server 
has data content made available via a shared filesystem, typically a web server which allows 
connections via ports 80, 443, or 8080. These ports are the standards, but web servers can be 
configured to accept connections on different ports. Once the two hosts are connected via a 
three-way handshake using TCP packets, the web client submits a request to the web server 
in the form of an HTTP message. This request can be a GET, HEAD, POST, PUT, DEEETE, 
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TRACE, or CONNECT [21]. The most commonly used requests are the GET, POST, and PUT 
requests. Once the web server receives the request, it acknowledges the request and proceeds 
to send TCP packets with data comprised of HTME code as the base object of the request. It 
is common to have additional objects within a single web page such as embedded javascript 
applets, images, or dynamic content such as advertising services. Eor example, a user submits 
a web request using a web browser. Once the HTTP data is received by the client machine, 
the web browser will notice embedded links within the HTME of the web page to various 
elements. Each additional element will initiate another request from the web browser, resulting 
in numerous GET requests originating from the web client machine. The take away here is that 
the HTME content is the base information required to render a web site to a requesting web 
client. The additional elements are usually links to specific pieces of content for the website, 
and are standalone network requests when considering the network traffic from a web client to a 
web server. Sometimes the extra content objects reside on the same web server as the HTME. In 
cases such as this, additional connections are not necessary. The browser window will display 
the content of the data transfer as it becomes available [12]. 

Web browsing requests may be directed to Content Distribution Networks (CDNs) instead 
of the actual web server through Domain Name Service (DNS) request handling. A CDN is 
a large network of servers deployed across multiple networks in several data centers, often 
geographically distributed [22]. CDNs are employed on the Internet to reduce network load 
on the original web server, reduce load on the Internet as a whole, and provide faster response 
times to edge hosts that typically make the web requests. The way that CDNs provide current 
web content is by replicating their customers’ data throughout their CDN data centers [15]. 

CDN distribution nodes keep track of the most current web server content. When an update 
occurs, the distribution node pushes the new material to each of the data centers. These data 
centers may operate behind reverse proxy servers at the Internet edge. In this case, reverse 
proxies would handle the incoming requests and pull the data from the data servers or use the 
cached data at the proxy server. CDN servers rely on edge host DNS requests for web pages 
to serve web content. They intercept web page requests from edge hosts and redirect the edge 
hosts to their CDN data centers. Edge hosts can bypass CDN services by identifying the origin 
web server’s Internet Protocol (IP) address and connecting directly to that address vice using 
the DNS service. 

CDNs are an interesting case of web browsing connections. They are globally positioned to 
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provide data content to end users. Each CDN stores thousands of web page files as a service to 
companies [23] willing to pay for this performance boost. The types of companies that typically 
request CDN support are those that serve thousands or more clients across multiple geographic 
regions. Typical companies include commercial online stores, news agencies, and corporations. 
CDNs must pay Internet service providers to provide Internet connectivity. There are also non¬ 
commercial examples of CDNs. Peer-to-peer CDNs are used in distributing files across the 
Internet. Hosts may "opt-in" to such a CDN in order to download files. The concept is that the 
originating data provider takes less of the network load as participating users increases. Each 
user’s host machine contributes to sharing the file once the file is downloaded. This type of 
CDN is not as pertinent to web services such as HTTP, and will not be discussed further. 

Another way to enhance web browsing performance is through web caching. This is different 
from CDNs in that web content can be cached at the edge host making the request or any server 
along a network path between the web client and the intended destination, the web server. Some 
sources will define a web cache as a form of proxy [15], though web caching can occur at the 
edge host as well. Web caching is much like memory system caching on a computer [16] in 
that the cache stores anticipated requests to speed up response time. Caching at a reverse proxy 
server will affect packet round trip times. If the reverse proxy has the most up-to-date content, 
then it can serve the data immediately instead of handling a connection to the backend server. 
Round trip times would appear to be those of a direct connection instead of a reverse proxy. 

An interesting caching mechanism that closely resembles a reverse proxy is that of the trans¬ 
parent cache [7]. Transparent caches (see Eigure 2.7) intercept port 80 web requests at the 
switch and/or router and serve web content using policy-based redirection to one or more back¬ 
end web cache servers or cache clusters. The main difference between a reverse proxy and a 
transparent cache is that a reverse proxy is software based whereas a transparent cache is im¬ 
plemented through hardware and routing policies. An additional difference is that transparent 
caching can be done throughout a network, and is not solely implemented near the end of the 
network path like a reverse proxy server. Overall, caching is implemented heavily through¬ 
out the Internet. Conversely, we expect that caching on compromised hosts is highly unlikely 
because it would significantly increase the footprint of the malicious user which is deemed 
undesireable. 
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Figure 2.7: (a) Standalone cache (forwarding proxy), (b) Router - Transparent cache, and (c) Switch 
- Transparent cache 

2.2 Related Work 

There are several fields of study that foeus on identifieation, elassifieation, and analysis of 
network characteristies. Fingerprinting is an area of researeh that attempts to mateh an identi¬ 
fiable eharaeteristie to a host, server, or user. Traffie Analysis attempts to understand network 
and user behaviors by examining the movement of data across systems. Timing analysis uti¬ 
lizes network timing data from delays, packet round trip times from host to host, as well as 
clock skews to build a classification mechanism. These methods are reviewed below as related 
work to this study. 

2.2.1 Fingerprinting 

The technique or techniques used to identify unique features of a program, service, or sys¬ 
tem is defined as fingerprinting. Fingerprinting techniques are typically classified as active, 
passive, or semi-passive [24]. Kohno et al. describes these techniques as the following. Passive 
fingerprinting techniques are where the fingerprinter, a measurer, attacker, or observer, is able 
to obsere the traffic from the device, or fingerprintee, that the attacker wishes to fingerprint. 
An active fingerprinting technique is where the fingerprinter must have the ability to initiate 
connections to the fingerprintee. A semi-passive technique is defined as a situation where, af¬ 
ter the fingerprintee initiates a connection, the fingerprinter has the ability to interact with the 
fingerprintee over that connection. An example provided states the fingerprinter as a website 
with which the device is communicating, or is an Internet Service Provider (ISP) in the middle 
capable of modifying packets en route [24]. 

The Internet is a vast network. Researchers will rarely have direct access to perform passive 
or even semi-passive measurements. Because of this, we desire active methodologies that we 
can use to gain knowledge where there previously was none. We achieve this goal through iso- 
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lating unique characteristics of our identified target and formulating algorithmic steps by which 
we fingerprint systems, normally via remote means. There are a variety of known methods 
for fingerprinting systems. Several of the approaches that are pertinent to this study include 
clock skew fingerprinting, website fingerprinting, reverse proxy fingerprinting using the Exten¬ 
sible Markup Language (XML), statistical network traffic fingerprinting of botnets, and timing 
channel fingerprints. 

There are a multitude of ways to identify operating systems remotely. A recent study found 
that system clock skews can be used to fingerprint not only an operating system but uniquely 
identify individual systems with microscopic timing deviations in hardware [24]. An important 
discovery in this research was that computer chips have uniquely identifiable clock skews that 
are discemable from practically anywhere using active measurements over either ICMP or TCP. 
They were able to validate that system clocks and TSopt clocks were highly varied across large 
sets of systems. This is important to note since our active measurements for this thesis will be 
focused on our web client system clock times instead of TSopt values. The time stamps that 
are used in packet captures are pulled from the operating system kernel [25]. Therefore, time 
stamps for incoming and outgoing packets are all based off of the web client’s operating sys¬ 
tem and not exact times in reality. A second important takeaway from this research was that, 
due to the high amount of variability in network delays referenced in Vem Paxon’s paper [26], 
some sort of linear programming or smoothing mathematics are required to apply fingerprinting 
measurements with an acceptable level of accuracy. This thesis research will use a statistical 
threshold to handle the timing fingerprint used against web servers for reverse proxy identifica¬ 
tion. 

There are a few notable techniques for fingerprinting websites based on network traffic anal¬ 
ysis. One such way builds on Hintz’s groundwork which simply counts the downloaded files 
and file sizes for individual websites and compares those two indicators against SafeWeb user 
traffic passively [27]. The second iteration of research performed by Gong et al. [28] actively 
studied spikes in traffic patterns using side channels and the queues on links coming from edge 
network hosts. The authors monitored the encrypted network traffic on the incoming DSL link 
while actively pinging the target host. Based off of the latencies resulting from queueing, they 
were able to fingerprint the target host websites visited by using Time Warping Distance (TWD) 
as a similarity metric (Ligure 2.8 [28]). Graph (a) represents the number of bytes transferred 
over time when requesting the website Yahoo.com. Graph (b) represents the observed RTTs 
from pings sent to the target host. Graph (c) is the processed version of Graph (b), bringing 
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out the distinct spikes in RTT behavior that can then be compared to graph (a) for matching 
target host website activities. The RTTs of their pings indicated differences from the norm, in 
this case queueing, and could be compared to their known data sets of website traffic patterns to 
identify, or fingerprint, the edge host network activity. Likewise, we will use RTT differences 
from our hypothesized expectations to differentiate web server types. 




(b) Observed RTTs 



(c) Processed RTTs 

Figure 2.8: Real traffic on a DSL vs. probe RTTs 

The second methodology for website fingerprinting utilized a multinomial Naive-Bayes clas¬ 
sifier to subvert Privacy Enhancing Technologies (PETs), such as Tor, VPNs, and JonDonym, 
using IP packet frequency distributions [29]. The Naive-Bayes classifier must assume inde¬ 
pendence among the variables, which is undesireable. This type of classifier has a successful 
history in website fingerprinting research. Though the use of the classifier is very different, we 
will employ a Naive-Bayes classifier to identify reverse proxies once we have an established 
timing threshold. Similar to the previous website fingerprinting research, the attacker must 
have a database of known and/or anticipated websites to fingerprint the target user’s traffic. Our 
methodology requires a list of websites that we use to generate traffic captures, but does not re¬ 
quire any former traffic knowledge to fingerprint servers. Once we establish the threshold time 
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expectations for a reverse proxy server versus a typical Internet-facing web server, we expect 
to be able to identify any IP address serving web content as a regular web server or a reverse 
proxy. 

United States Patent Number US 7,272,642 B2 [10] covers the detection of reverse proxies 
and establishing connections to the hidden servers via tunneling by analyzing differences in 
affinity identifiers in network traffic. Affinity identifiers represent specific portions of a physical 
or virtual device. Examples include MAC addresses, IP addresses, virtual network identifiers, 
and so on. The specific affinity identifiers that they reference in their work focus on differences 
between destination IP addresses of active code, or java applets, and the web server. 

Active code segments within a web page are not capable of being cached, yet are linked 
within the web page. The links to the active code segments within a web page are observable, 
and can be compared to the web server address for differences. If it is determined that there 
is a difference in destination IP address, the research presents a methodology to establish a 
tunnel directly to the Webserver by crafting specific HTTP requests utilizing the outlying affinity 
identifier(s), effectively bypassing and fingerprinting any reverse proxy in the process. This 
detection patent will only work for web servers that handle active code segments within the 
web content, though it could be combined with our work to have more thorough fingerprinting 
results. 

Peer-to-Peer botnets are currently a more common architecture because they are harder to 
take down, more resilient to host gain and loss, and stealthier in their activities. Zhang et al. [30] 
describe a methodology in an enterprise network where they monitor all network traffic at edge 
nodes to identify P2P activity. The fingerprinting focus is on the C&C communication patterns. 
They use a phased approach, which is helpful in managing the fingerprinting process. 

The first phase is to reduce the large amounts of captured network traffic to only that which 
could potentially be related to P2P communications. From this reduced network traffic they 
use statistics to isolate flows related to P2P communications and identify P2P clients within the 
network. The second phase they implement a detection system that analyzes traffic generated by 
the previously identified P2P clients and classifies the clients into legitimate P2P clients or bots. 
An interesting theory that they utilized with great effect was that bots within a botnet operate 
the same way throughout the botnet. We could attempt to take this theory for our research, 
assuming that reverse proxy signatures for similar botnets or C&C managers would remain 
consistent, leading to a wider view of reverse proxy operations within a botnet. 
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One of the most interesting fingerprinting studies is the work done by Shebaro et al. [31]. 
They proposed an active method to leave a fingerprint on machines hosting illegal services that, 
in conjunction with timestamps from hidden service log files in those hosting machines, could 
be used to validate illegal activity in a criminal case. The anonymity approach used by Tor host¬ 
ing machines was to have a web server application such as Apache [19] using pseudodomain 
names. The researchers used HTTP GET requests from controlled hosts through the Tor net¬ 
work to a controlled Tor hosting machine. They were able to monitor hidden service logging by 
Apache and Tor as well as control the rate that GET requests were submitted through the Tor 
network. They sent GET requests every 5 seconds over 55 hours to identify delay distributions 
through the Tor network since the Tor routes were changed every 10 minutes. They built Reed- 
Solomon codes that were basically code words for GET request submissions at 1 bit per minute 
for each bit of the code word to the hosting machine. This timestamp of requests, which was 
handled in conjunction with delay distributions to ensure a noticable fingerprint was embedded, 
was identifiable on the hosting machine in the hidden service log files. In our research we do 
not expect to leave any fingerprints of our activity on the target machines. 

2.2.2 Traffic Analysis 

Traffic analysis studies have been performed since the Internet began with the ARPANET at 
the Network Measurement Center of University of California-Eos Angeles [11], [32]. Traffic 
data records the time and duration of communications, and the analysis of this data can be used 
for a variety of purposes. Several of the purposes are important because the timing of network 
traffic affects host and server performance, delays imparted onto all users of a network, as 
well as the quality of some services such as streaming media. We need to understand network 
impacts on packet round trip times such as variance and latencies, packet dynamics, and the 
various ways timing analysis can be affected and used to glean understanding and knowledge. 
Traffic analysis has military and law enforcement roots [32]. It was used to gain intelligence 
against adversaries and criminals. The root goal of traffic analysis is to understand what is going 
on in an area where the amount of data is overwhelming and non-intuitive. We will first look 
at network dynamics that pertain to traffic and timing analysis, then address several important 
analysis research topics that pertain to our current research. 

Vern Paxson did a study on packet dynamics in the late 1990s that underlined the difficul¬ 
ties with larger scale network measurements [33]. He first describes that analyzing packets 
using TCP requires a way to gather timestamp values for the sending and receiving of those 
packets, which TCP does not always provide using TSopt. We rely on packet filtering tools 
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and system clocks to provide timing values for packet activities. These filtering tools, tcpdump 
and wireshark for example, may improperly label times or miss timestamps entirely because 
of "measurement drops." Furthermore, TCP packets can be sent over a short duration for small 
TCP flows or for hours in cases where there are large numbers of packets transferred in large 
files. These irregularities complicate our ability to do correlational and frequency-domain anal¬ 
ysis. This point is addressed in many studies, including this thesis project, by using a number of 
tests deemed statistically sufficient to cover the normal cases with guaranteed outliers. A high 
enough number of test cases can average or smooth out the results in order to do correlational 
and/or frequency-domain analysis. 

Out-of-order packet delivery is prevalent in the Internet due to the close proximity of data 
packet transmission which is affected by queueing rates, route flutter where packet routes 
change over time, and assumed packet loss combined with duplicate packet transmissions. More 
specifically, data packets are more often reordered than acks because they are frequently sent 
closer together [33]. Route flutter is defined as the "fluttering" of network routes changing 
which affects the timing and arrival of packets. Route flutter is assessed as being site depen¬ 
dent. Some websites exhibit higher amounts of route flutter and out-of-order packet arrivals 
than other sites. This point contributes to our reasoning for performing a large number of HTTP 
requests from a single host to a single target which we will highlight in Chapter 3. Though the 
routes may change and timing data may be inconsistent for some flows, the averages should be 
telling. 

Packet loss in the late 1990s was calculated between 2.7% and 5.2%. The larger percent¬ 
age was assessed to be partially due to a bigger transfer window that allowed for more data in 
flight. This makes sense because queues would build up faster and have a higher likelihood for 
dropping packets. Internet bandwidth has increased significantly since this study. As such, it is 
expected that packet loss rates are even lower than reported here. A note of interest regarding 
packet delay is that one-way transit times (OTTs) are often not well approximated by dividing 
RTTs in half, and that variations in the paths’ OTTs are often assymetric [33]. Lastly, packet 
and timing compression can occur because of bottlenecks along transit routes. This could po¬ 
tentially sway timing results with RTTs given a small sample size. As previously mentioned, we 
will perform a large number of probes in order to analyze timing characteristics without being 
blinded by timing or packet compressions. 

Martin et al. [34] examined the different delay times on the Internet focusing mainly on 
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Figure 2.9: Simple HTTP session 
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Figure 2.10: SYN / ACK RTT Figure 2.11: FIN / ACK RTT 


HTTP traffic and server-side delays. Both delays are primary factors in our research. They 
analyze HTTP flows from both directions, identifying message types (SYN, ACK, FIN, etc.) 
and assessing packet timeouts or packet loss. They define a typical HTTP flow (Figure 2.9 [34]), 
and provide examples of RTT values for the typical HTTP flow (Figures 2.10, 2.11, 2.12, 
2.13 [34]). Figure 2.9 [34] does not reflect what is seen in reality. The difference between real 
web requests and those examined in this study, as shown in Figure 2.9 [34], is that real web 
requests ACK the web request first then send follow-on packets with data. The web request 
ACK does not contain any data. 

The authors also acknowledge the variety of RTTs in a simple HTTP flow. The first distinct 
RTT is the three-way handshake (3WHS) which is comprised of a SYN from the client to the 
server followed by a SYN-ACK from the server to the client. The web request from the client 
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Figure 2.12: Data Packet Timing Example 


Figure 2.13: Data Packet RTT 


to the server and the server response ean also be elassified as an RTT. For the purposes of our 
research, we will define this web request RTT in Chapter 3. It will not be consistent with this 
research or most RTT estimation research in that we include multiple packets for identificaiton. 
The estimated percentage of lost and duplicate packets were consistent with Paxson’s findings 
(roughly 2% to 8%) [33]. 

Martin et al. propose to use the minimum RTT as a measure of latency within a connection. 
However, they do recognize that taking a minimum RTT value is very sensitive to errors. If 
the route changes in the middle of a flow, the measurement could be miscalculated. This risk 
was mitigated by executing longer lasting flows in the hours or close to it. A key takeaway in 
queueing delay is that the two directional links in a connection can be unambiguously aligned 
so that no replies occur out of order. For the purposes of network measurement accuracy, a 
few milliseconds of variance in precision is sufficient given that the size of most delays are in 
the tens or hundreds of milliseconds. Lastly, it was concluded that network delays were the 
dominant effect on overall delay within HTTP flows. This is very important for our research 
since we ascertain that the network delays will cause timing values to "give away" the presence 
of a reverse proxy. 

Kam’s algorithm [35] defines how RTTs should be measured when handling packet retrans¬ 
missions for network traffic shaping. Kam’s algorithm is very important to understand because 
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most previous work with RTTs utilize this definition. It states that “When an acknolwedgement 
arrives for a packet that has been sent more than once (i.e., retransmitted at least once), ignore 
any round-trip measurement based on this packet, thus avoiding the retransmission ambiguity 
problem. In addition, the backed-off RTO for this packet is kept for the next packet. Only when 
it (or a succeeding packet) is acknowledged without an intervening retransmission will the RTO 
be recalculated from SRTT” [35]. This definition of RTTs is used in most current traffic flow 
analysis tools such as TCPTrace, tstat, and others. This presents challenges when attempting to 
use current tools for our work. We will have to build our own definition of RTTs not only to 
include RTTs that would previously have been ignored but also to omit some RTTs that would 
not be indicative of a reverse proxy. 

An important study on variability in round trip times [36] within TCP connections showed 
that RTTs within connections vary widely. This impacts our research because we cannot simply 
average round trip times and estimate small differences as valid identification measurements. 
There must be a distinct difference in timing values for the round trip times, and the variability 
in the round trip times must be examined closely to increase classification accuracy. This same 
paper notes that ACK packets may be delayed in some connections to every other packet, which 
is an optional TCP feature [37]. Should this occur, RTT averages for comparison in our study 
would be skewed. Along this same line of thinking, TCP slow start will cause variance with 
packet RTTs since more data will be sent. In addition to using Karn’s algorithm in this RTT 
variability study, Aikat et al. added additional tests to reduce RTT samples even more. We 
will diverge from both of these RTT sampling methodologies in that we primarily care about 
the three-way handshake RTT and the time difference between the client’s GET request and the 
first data packet response from the Webserver. 

Vern Paxson examined the clock skews and clock adjustments from client and server vantage 
points to see how common they were between separate hosts [26]. He discovered that they are 
indeed common, and that using the network time protocol (NTP) does not resolve these errors. 
His discovery suggests that, when measuring packet features and delays from two or more 
endpoints, clocks are rarely ever synchronized. Since they will be off in their timestamps, it is 
erroneous in many cases to correlate client and server clock values in measurement calculations. 
Times from one host should be isolated and addressed only with times from that host. He also 
describes clock offsets. System clocks will be offset to some degree from the time established 
by national standards. He defines skew as the frequency difference between the clock and 
national standards, a simple inverse formula from the offset value. The remaining portion of his 
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work here foeuses on an algorithm for identifying eloek adjustments and the variability in eloek 
eharaeteristies. We will not be "eross-pollinating" our timing values between server network 
traffie samples and elient traffie samples. In faet, our elient traffie samples are our main foeus. 
The majority of our network measurements in this researeh projeet will be solely from the 
elient’s perspeetive. Future work may address multiple vantage points, and for this reason this 
is ineluded as a faetor in measurement ealibrations. 

The final pieee of related work we wish to diseuss revolves around the preeision of times¬ 
tamps for reeorded network traffie [38]. Jorg Mieheel et al. reeognize the differenees in la- 
teney sinee Paxson’s paeket dynamies work in the late 1990s as a growing issue in network 
measurement studies. Smaller lateneies over the Internet mean that the onee aeeeptable lOms 
resolution of system eloeks is no longer an option to get valuable test results. They propose 
the implementation of a hardware-driven timestamp using DAG measurement eards [39]. They 
were able to inerease timestamping preeision by at least an order of magnitude, but software 
solutions remain the norm throughout most of the world. In future experiments, these hardware- 
based eloeks might be imperative to gain valuable timing differenees in higher-speed networks. 
Though the eurrent state of the Internet provides high-speed eonneetions with low lateney val¬ 
ues, the RTT values are not small enough yet to be washed away by typieal system eloek time 
resolutions. 
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CHAPTER 3: 

REVERSE PROXY FINGERPRINTING 
METHODOLOGY 


3.1 Overview 

We first discuss our initial study and analysis of RTT values for a direct HTTP connection 
with a web server as well as a connection through a reverse proxy. This motivates the deeper 
review of packet dynamics within HTTP requests. The differences in TCP flows between stan¬ 
dard web servers and reverse proxy webservers are discussed in detail in order to highlight 
unique characteristics. These characteristics are used in creating an algorithmic approach to 
step through a single TCP flow to identify the values for the 3WHS and GET RTTs. The need 
for a multi-flow algorithm is presented for large-scale data parsing of pcap files. The method 
we use for data collection and reasoning for our choices of web sites upon which we test our 
algorithm is presented to motivate our implementation in the next chapter. 

3.2 Analysis of Basic Concept 

As a precursor to this thesis research we did a preliminary study to determine whether or 
not round trip times could differentiate between direct connections to web servers and from 
those to reverse proxy servers. We built a simple web page that was less than 1000 bytes in 
size in order to be delivered in a single data packet, which is the best case possible since there 
is no server-side processing. We hosted this website through the GoDaddy webservices [40] at 
http: //WWW. 4550testing. com/, which hosts all websites on an internal network unless the 
website has signed up for a dedicated IP address. In the first phase of our experiment we scripted 
HTTP requests and network traffic captures from the client machine located in Monterey, CA, 
directly to the hosted website whose server was in New Jersey. In the second phase of the 
experiment we ran the same script but sent the requests to a reverse proxy server built off of 
Apache [19] located in Monterey, CA, which handled backend traffic to our hosted website. 

Since this was our first attempt in identifying RTTs, we designed a method to give us some 
control in the packet transfer between the client and the server in order to identify the GET 
request. We built a simple webpage in html that was smaller than the maximum segment size 
(MSS) for a single TCP packet. Using this data size gave us the ability to see the entirety 
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Figure 3.1: Filtering Method to Identify Figure 3.2: Filtering Method to Identify GET RTT - 
GET RTT - Direct Connection Reverse Proxy 


of the webpage delivery in a single packet once the GET request was acknowledged by the 
server. Our methodology for pulling the GET RTT was to save the packet data in text format 
using Wireshark’s [41] command-line version known as Tshark. We then used linux regular 
expression tools such as sed, awk, and grep to look for the HTTP GET packet and the HTTP/1.1 
200 OK packet. Since the webpage was delivered in a single packet, we could use these filters 
to find the timestamps to calculate the GET request RTT (Eigures 3.1 and 3.2) where tj and 
t 2 denote the 3WHS and GET RTTs respectively. We did not know that there was an ACK 
packet following the GET request during this experiment, but identified the GET RTT based 
on our expectation that the GET response packet would contain data and the HTTP 200 OK 
message. Eigures 3.1 and 3.2 reflect actual packet transitions so that the reader may understand 
why our experiment approach worked and to motivate our GET RTT analysis in a later section. 
This method was sufficient for our small, controlled experiment. We knew we would need to 
adapt our methodology when experimenting with real world web sites because of variable-sized 
data transfers, interacting with different network infrastructure, variable distances to proxied 
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webservers, and so forth. 


We inspected all of the network traffic capture files for the direct connections (Figure 3.3) 
as well as those for the reverse proxy (Figure 3.4). In the direct connections, the 3WHS and 
data RTFs were nearly identical. However, the reverse proxy RTFs were significantly different 
as we expected. Fhe additional time for the reverse proxy to establish a backend connection to 
the content-hosting server added extra latency for the data packets. Fhe results from this simple 
experiment strengthen our hypothesis that it may be feasible to identify reverse proxy servers 
with a FCP timing analysis. 


♦ HFML_GET_200_OK 
■ SYN_SYN/ACK 

- HFML_GET_200_OK (MEAN) 

- SYN_SYN/ACK (MEAN) 


30 


Figure 3.3: Direct Connection RTTs 

3.3 Analysis of Packet Dynamics within HTTP Requests 

We need to understand how packets are exchanged in typical FCP flows in order to identify 
proper RFF values. We define a flow as the connection setup, data transfer, and connection 
termination. Our first examination is to look at packet behaviors within direct connections. 
Following these tests, we will implement the same requests but perform them through reverse 
proxy services. Fo ensure consistency across reverse proxy services, we run the same tests 
against Apache version 2.2.22 (Ubuntu), Squid version 3.1.20, and nginX version 1.2.1 individ¬ 
ually. We controlled the client and the reverse proxy, so we were able to capture the network 
traffic from both vantage points. 
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♦ HTML_GET_200_OK 
■ SYN_SYN/ACK 

- HTML_GET_200_OK (MEAN) 

- SYN_SYN/ACK (MEAN) 


Figure 3.4: Reverse Proxy Connection RTTs 


3.3.1 Base Cases 

The basic direct connection test was performed using Apache web server configured as a 
regular Webserver (Figure 3.5). The client workstation used Mozilla Firefox as the web browser. 
The client made a direct connection using the Webserver’s IP address and the resulting TCP 
traffic was collected at the client and reverse proxy server vantage points. We looked at the 
traffic generated from both vantage points to look for differences. 

We expect that the GET request following the 3WHS will be replied to with an ACK and 
data in a single packet, but this is not the case. The GET request receives an ACK response 
with no data, followed shortly thereafter with a data packet. TCP slow start begins thereafter. 
The 3WHS is denoted as ti and t 2 is the time between the HTTP GET request and the first data 
packet. We refer to the ACK TCP packet to the HTTP GET request as the GET ACK packet. 
The GET ACK is a function of TCP that requires acknowledgement of packets that contain data, 
which in this case contains the HTTP request in the TCP data section [42] from the client to the 
server. We define the GET RTT as the time between the HTTP GET request and the first packet 
that contains data. 

We used the same setup for the basic reverse proxy connection test (Eigure 3.6) except that 
we configured Apache to operate as a reverse proxy to GoDaddy [40]. We capture the network 
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data from both the client and reverse proxy vantage points, similarly comparing the data as 
we did in the basic direct connection. The 3WHS RTT and time between the HTTP GET 
request and the GET ACK packet were nearly the same RTTs as those we observed in the direct 
connection test. The difference in time for the first data packet to arrive is significantly different. 
The reverse proxy has to establish a connection with the back end server (GoDaddy [40]) before 
it can provide data to the client. Production reverse proxy servers may maintain connections 
with back end servers, negating the requirement for a second 3WHS. The data in production 
servers still has to be processed and submitted, however, so a larger RTT for the GET RTT will 
still exist. 

Once several data packets are received by the reverse proxy, the data is transferred to the 
client. This back and forth process continues similarly to the direct connection network traffic. 
The client network capture shows no noticeable differences between the reverse proxy and direct 
connection except for the drastic increase in t 2 . We draw the timing diagrams in Eigs 3.5 and 3.6 
based on the network traffic we observe in both pcap files to highlight the timing difference. 
Though not highlighted in Eigures 3.5 and 3.6 the connection tear down has the same behavior 
as the 3WHS. The difference between Eigures 3.5 and 3.6 and Eigures 3.1 and 3.2 is that we 
recognized the HTTP GET request ACK packet instead of focusing on HTTP messages. 


3.3.2 Test Case Examples 

Referring to Eigure 3.5, we can see that the initial packet exchange is as expected for the 
three-way handshake. We conduct the same direct connection and reverse proxy tests using two 
separate web browsers, an HTTP request tool, and three web server programs. 

Using Apache as a reverse proxy as we did in the base case shows identical network traf¬ 
fic results between Internet Explorer, Mozilla Eirefox, and WGET (Eigure 3.7) because TCP 
handling in the kernel is common to all three. The time to establish a 3WHS(ti) was only 2.2 
percent of the GET RTT (t 2 ). The time between the HTTP GET request and the GET ACK 
packet was 0.018131 seconds, which is a similar RTT to the 3WHS RTT and is calculated using 
Kam’s algorithm [35]. The 3WHS RTT was roughly 45 percent of the time of the GET RTT as 
defined by Kam’s algorithm. This difference is minimal. When we look at the time required for 
the first data packet to arrive, we can easily see the variation in timing between the RTTs which 
is indicative of the reverse proxy. 

NginX and Squid had similar behaviors. They were handled differently by the back end 
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Figure 3.5: Direct Connection Figure 3.6: Reverse Proxy Connection RTTs 

RTTs 

servers to which they were connecting when compared to Apache. Regardless of the browser, 
both servers would establish the connection via a 3WHS, ACK the GET request, then attempt 
to connect on the back end to the website. The website would receive the reverse proxy GET 
request, and occasionally send a 302 message redirecting the client to the same server but a 
different resource, such as a PHP webpage. The redirection would bypass the nginX and/or 
Squid reverse proxy servers. An HTTP 302 message is defined as a temporarily relocated 
resource message that redirects the client to another URE. If the client wants to return to the 
same website in the future, it should visit the same website it tried in the beginning [21]. 

We tested both of the servers against multiple websites. The website seemed to be the de- 
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Figure 3.7: Reverse Proxy Timing Diagram from the Client's Perspective using Apache 


termining factor as to whether or not a 302 message would be returned through the reverse 
proxy. For example, Amazon would not submit 302 messages while the United States Naval 
Aeademy’s website would. The 302 message would get sent to the elient via the reverse proxy, 
leading to a conneetion termination between the client and the reverse proxy. The client would 
then make a second connection to the newly specified URL provided within the 302 message 
(Figure 3.8). Regardless of the response from the baek end server, the time required for t 2 is 
notieably larger than tj. This eharaeteristic is essential in identifying reverse proxies, which we 
diseuss in the next seetion. 
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Figure 3.8: Reverse Proxy Timing Diagram from the Client’s Perspective using nginX and Squid 
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3.4 Algorithmic Approach 

Manual inspection of several flows revealed that the 3WHS is within a few standard devi¬ 
ations when compared to direct connections as well as reverse proxies. We also see similar 
packet flow between the same two types of connections. The main difference between a direct 
connection and a reverse proxy connection is that the time delta between the HTTP GET re¬ 
quest and the first data packet from the server is significantly greater for a reverse proxy. RTTs 
are typically calculated using Karn’s algorithm [35], which only takes into consideration the 
order that packets are sent and received. Specifically, “When an acknowledgement arrives for 
a packet that has been sent more than once (i.e., retransmitted at least once), ignore any round- 
trip measurement based on this packet, thus avoiding the retransmission ambiguity problem. 
In addition, the backed-off RTO for this packet is kept for the next packet. Only when it (or 
a succeeding packet) is acknowledged without an intervening retransmission will the RTO be 
recalculated from SRTT” [35]. RFCs 1185 [43] and 1072 [44] cover the current approaches to 
handling TCP ordering given the high bandwidth connections of present times.The challenge 
this presents is that our GET RTT will be approximately the same as the 3WHS RTT or smaller 
regardless of the TCP connection type. The HTTP GET request ACK includes no data, so the 
reverse proxy has no requirement to connect to the back end web server. 

3.4.1 RTT Analysis Tools 

Most TCP flow tools that provide RTT analysis use Karn’s algorithm [35] to identify RTTs. 
We initally expected that one of these tools would be able to automate RTT collection for us. 
We used TCPTrace [45] in an initial experiment to build a database of RTT measurements 
across thousands of flows. TCPTrace provides a lot of interesting RTT characteristics such as 
variance, averages, client to server RTTs, and server to client RTTs. One observation we had 
using TCPTrace is that RTT values are calculated from the server to client. The server to client 
RTT is calculated as the time elapsed from when an incoming packet is received by the web 
client to the time the response packet is sent from the web client. This time is solely based 
on processing delay at the client. The server to client RTT calculations would be most helpful 
given network traffic capture from a vantage point between the client and server, but our study is 
focused on client vantage points. RTT times are generally associated as elapsed time for packets 
in transit due to queueing and propagation delay. Additionally, TCPTrace rounds times to the 
tenths of milliseconds, which is not a high enough resolution to capture many time stamps based 
on typical Internet speeds. We also experimented with Tstat [46], which yielded similar RTT 
results to TCPTrace, and Captcp [47] which was not as helpful as the former two. Each tool we 
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tested had an RTT estimation process that did not suit our needs. We decided to implement our 
own approach instead of using a common tool. 

Our definition for the 3WHS remains the same as Karn’s algorithm. The 3WHS is defined 
as the time from when a SYN packet is sent from a web client to the time a SYN-ACK packet is 
received by the web client. This RTT will typically be dominated by propagation delay between 
the web client and the web server except in some cases where high server load results in large 
queueing and/or processing delays. We discard Kam’s RTT definition for GET RTT calculation 
in order to include the first data packet from the server on the wire. Our definition for GET RTTs 
is the time when a GET request is sent from the web client to the time that the first packet that 
includes data is received (See Eigures 3.5 and 3.6). This definition disregards the GET request 
ACK packet in order to include the time delay for data transfer. The data packet does not have 
to have the correct sequence number because we only care about receiving a data packet. Data 
packets require a connection to the back end server for a reverse proxy so out of order data 
packets are sufficient. 

The GET RTT results for direct connections are relatively equivalent to their corresponding 
3WHS RTTs. Non-production reverse proxies, such as those used by smaller websites and likely 
malicious websites, must establish a new connection to the backend server and retrieve some 
number of data packets before data can be transferred to the web client. It is also important to 
recognize that many webpages contain embedded links to data such as images, advertisements, 
tables, and so forth within the HTME code. Embedded links may cause multiple GET requests 
to occur within a single TCP flow. If multiple GET requests occur within a TCP flow, we apply 
the same GET RTT definition and count the GET RTT result as additional validation for our 
timing analysis. 

Similar to our approach for test cases, we wanted to use a bottom-up approach to identifying 
RTTs. We developed a method to obtain the 3WHS and GET RTTs based on packet character¬ 
istics in a single flow pcap file so that we could do manual inspection on our results. Once we 
were able to verify our method worked, we adapted the method to handle multiple flows within 
a single pcap file. 

3.4.2 Single Flow Analysis 

In order to identify the 3WHS RTT and the HTTP GET request RTTs, we present pseu¬ 
docode to step through a pcap file of our own generated network traffic and identify the packet 
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features that give us the timestamps we need. The first example is shown in Algorithm 1, where 
we step through a single TCP flow using states and packet directions between the web client 
and the web server. Since a single flow may have multiple GET requests, we keep track of each 
GET in an ordered map of GET requests within the map h. 

We open a pcap file p, read a single packet at a time, and store the packet data into buf 
and the packet time stamp into ts. We use a finite state machine (Eigure 3.9) to keep track 
of the flow’s current state. Each complete TCP flow establishes a connection, transfers some 
data, and closes the connection. We use this knowledge to establish states. We instantiate a 
new flow by setting the state to START. Eirst, we read a packet from the pcap file p. Since 
we are looking at HTTP flows we use a destination port of 80, the established port for HTTP 
traffic [48], to determine if a packet is inbound or outbound from the web client. If the packet 
is outbound and the flow is in the START state, we save the time stamp for the first part of the 
3WHS_RTT and move in our state diagram to SYN_SENT. We read in a packet, looking for 
the SYN and ACK flags to be set in the TCP header. If we meet these prerequisites, we set the 
flow state to SYN_ACK, subtract the first time stamp from the current timestamp ts to calculate 
the 3WHS_RTT, and save the 3WHS_RTT into h, using the clientPort as the key in h. We use 
client Port as the key because (1) ports are rarely repeated in close proximity given that there 
are tens of thousands of allowable port numbers per client and (2) we will limit the number of 
TCP connections per pcap file p to less than the maximum allowable number of TCP ports. 



Figure 3.9: FSM for parsing TCP flows with HTTP 


The flow has been saved into h at this point. Some flows may not contain any GET requests 
because of failed connections, time outs, or legitimate TCP connection closed procedures with 
EIN-ACK packet exchange. The state may finish at this point in the state diagram. However, 
if we read an outbound packet with the PSH flag set we set the flow state to GET_SENT. The 
PSH flag is used to to notify TCP to send data immediately from the server side and to push 
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data to the receiving application immediately from the client side [42]. In this case, the web 
client used a TCP packet with the PSH flag set in order to send the GET request. We then save 
timestamp ts as the first marker for a GET_RTT. 

If we are in the state GET_SENT and we read an inbound packet that has no data, we 
identify this as the GET ACK packet. This packet would typically resemble the completion of 
an RTT, as we mentioned earlier in this chapter. We disregard this packet because it contains 
0 bytes of data, and therefore has not yet communicated with the data server if it is a reverse 
proxy. We read in another packet and check for some amount of data greater than 0 bytes. If we 
find an inbound packet that contains data, we calculate the GET_RTT by subtracting timestamp 
ts from our saved timestamp, add the GET_RTT to an array of GET RTTs in our map h using 
clientPort again as the key, and move our flow state back to SYN_ACK. Moving our flow state 
back to SYN_ACK allows us to test for more GET RTTs. If we identify more GET RTTs within 
the single flow in p, they are appended to h using the clientPort. If we read through p do not see 
any more GET RTTs, the flow is already saved in h and we move to the finished state. We may 
see anywhere from 0 to n GET RTTs for a single flow. This completes a single flow analysis. 

We know that multiple flows will be in a single pcap file p because of how we planned to 
collect our data. HTTP GET requests are likely to have one or more redirection, as discussed by 
Nolan et al [12]. A redirected connection will initiate a new TCP connection, leading to multiple 
flows in what was formerly considered a single flow. We do not want to follow redirections to 
website addresses outside of our targeted website lists for our current research, but may inves¬ 
tigate this further in future work. Eurthermore, embedded links may not be stored on the same 
server as the initial HTME which will cause even more TCP connections to be established. We 
recognize that we must adapt our algorithm to handle multiple flows with interleaving packets. 

3.4.3 Multi-flow Analysis 

We will certainly have more than one TCP flow per pcap file because of our data collection 
methodology. We account for this inefficiency by adapting Algorithm 1 to handle multiple flows 
with interleaving packets in a pcap file. We encompass all of the pseudocode in Algorithm 2, 
adding map g to handle flow State for each flow, filtered by clientPort, and adding map i for 
tracking flow timestamps. We continue to identify packets as inbound or outbound based on 
port 80. If we read a packet as outbound, we also check if the clientPort flow exists in g. If 
it does not exist, we instantiate a new flow in g using clientPort as the key, setting that flow’s 
state to START. This allows multiple flows to be tracked based on their individual states. 
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We modify our state transitions to look at flow state within g using clientPort to keep track. 
Our FSM remains the same as it did in Algorithm 1 except that we store time stamps in i using 
clientPort as the key. The i map allows us to track the most recent time stamp for each flow 
based on clientPort. We update i with the packet timestamp ts when we identify a stored flow 
in our FSM. The rest of the algorithm is carried forward in the same fashion, so we omit the 
remainder for brevity. 

We implemented Algorithm 2 into a program and ran it against several pcap files that we 
created by generating our own network traffic to various websites. We were able to verify the 
results by manually inspecting the tested pcap files using Wireshark [41]. We used a calculator 
and the timestamps from Wireshark to calculate RTT values using the same approach as de¬ 
scribed in Algorithm 2 for our manual inspection and verification. After numerous test cases 
were examined of varying size from single flow pcaps to multi-flow pcaps, we began probing 
lists of websites and collecting network traffic to analyze our methodology. 


3.5 Data Collection 

The data we wanted to collect needed to be close to an equal distribution of popular websites, 
known malicious websites, and known reverse proxy servers for comparitive analysis. Further¬ 
more, we wanted to have a variety of vantage points in order to observe whether or not our 
results were persistent globally. We built a simple script that saves the network traffic generated 
from each vantage point which we are able to analyze later. 

3.5.1 Vantage Points and Network Traffic Generation 

We used 10 PlanetLab [49] vantage points to collect our network traffic. These nodes were 
located in Germany, the United Kingdom, Thailand, China, Japan, and five locations within the 
United States (Albany, GA, St. Louis, MO, Berkeley, CA, Providence, RI, and Madison, WI). 
We used nodes that were globally distributed to see if the timing results showed consistent dif¬ 
ferentiation between the 3WHS and GET RTTs regardless of location. Instead of using domain 
names for our HTTP requests, which could be influenced by DNS lookups from each vantage 
point, we used bulk domain name service lookup software to identify domain IP addresses first 
then execute our network traffic based off of those IP addresses. DNS lookups return differ¬ 
ent results based on location. We kept the location consistent by resolving IP addresses from 
Monterey, CA. 
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3.5.2 Alexa Website Lists 

We want to look for differences in timing results across several layers of popular websites. 
We use Alexa [50] to pick these websites. Alexa is a service provided by Amazon [51] that 
calculates the top one million most frequently visited websites around the world. We chose to 
analyze the top 80 websites from Alexa which did not resolve to the same IP address. These 
websites are generally large companies, and employ IT personnel to manage their servers. This 
is important because though they may employ reverse proxy servers, the response times within 
their intranets or server farms will likely be incredibly fast. As a result, it will be more difficult 
to identify reverse proxies for these websites when compared to the other selected website lists. 

Secondly, we used a random number generator to draw Alexa ranks from the uniform dis¬ 
tribution of Alexa’s 5k to 10k websites for our semi-popular website list. We expect that these 
websites may show lower performance with response times. The types of websites we see in 
this range include universities, smaller companies, and foreign websites that may be limited in 
scope to users within their country of origin. Lastly, we use the same random number generator 
to collect 100 numbers between 500,000 and 1,000,000 to get some websites in the top 0.5% 
of all websites. There are roughly 700 million websites on the Internet, but only around 200 
million websites are active. We expect that this bottom tier of websites may show similar results 
to our malicious website list. Several websites look like the type of URLs one would see in a 
spam list, such as drugadmin. com, rxglobalmeds.com, or cheaprolexsale .biz. Further¬ 
more, some websites may be used for botnet command and control servers using HTTP as the 
communication channel, possibly even reverse proxies. 

3.5.3 Malicious Website List 

We needed to have a collection of known malicious websites. We searched the Internet for 
black lists or malicious domain lists, and found an IP address listing of malicious domains at 
http: //www .malwaredomainlist. com. This list included over one thousand different mali¬ 
cious domains, and was updated regularly. We chose to run our network traffic generator against 
460 of these domains because of time constraints on collecting the data. We chose to limit the 
number of retries per HTTP request to 1, and to limit the timeout value for each request to 30 
seconds. We wnated to limit the amount required for each timeout without cancelling longer 
delayed connections. Websites that no longer existed would create large delays for our col¬ 
lection script, which is where the bulk of our time constraints originated since we collected 
sequentially. Over one hundred of the domains were no longer active, so no connections were 
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established. An estimated 300 malicious domains allowed connections for us to examine. The 
malware domain list keeps an up-to-date record of each website by domain and IP address with 
a description of why it is classified as malicious. Some of the examples include known web 
browser exploits, trojan horses, or ZIP files that have embedded exploits. If we are able to 
identify any of these malicious websites as reverse proxies, the background information for the 
targeted website may be helpful in grouping similarities in reverse proxy servers. 

3.5.4 Known Reverse Proxy List 

We used a reverse proxy server that we controlled locally in conjunction with 10 known re¬ 
verse proxy servers. Our personally controlled reverse proxy server was built using Apache [19], 
with the back end target of GoDaddy [40] that was located across the country from our reverse 
proxy location. A typical reverse proxy will be in relatively close proximity to its internal data 
servers. We wanted to have a large delay to help with identification thresholds which we discuss 
in the next chapter. The additional 10 reverse proxy servers were gathered using Torrent proxy 
servers based in Europe. 
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Algorithm 1 Single Flow Collection with Round Trip Times Using State and Packet Direction 
1: Open pc ap file p; 

2: Create map h; 

3: Initialize State to START ; 

4: for each packet in p, read into buf along with timestamp ts do 
5: if (packet is not IP) then 

6: continue; 

7: Read the packet’s Ethernet and IP headers; 

8; if (protocol number is not TCP) then 
9; continue; 

10: Read TCP header; 

11: if (dstPort is equal to 80) then 

12: Direction is Outbound', 

13: else 

14: Direction is Inbound', 

15: if {Direction is Outbound and State is START) then 

16: if (SYN flag is not set) then 

17: continue; 

18: State is set to SYN_SENT ; 

19: firstTS = ts; 

20: continue; 

21: if {Direction is Inbound and State is SYN_SENT) then 

22: if (SYN and ACK flags are not set) then 

23: Packets are out of order, flow will not process; 

24: State is set to SYN_ACK; 

25: 3WHS_RTT = ts - firstTS; 

26: h.add{clientPort, [clientIP,serverIP,3WHS_RTT,0]); 

27: continue; 

28: if {Direction is Outbound and State is SYN_ACK and PSH flag is set) then 

29: State is set to GET_SENT ; 

30: firstTS = ts; 

31: continue; 

32: if {Direction is Inbound, State is GET_SENT , and data > 0) then 

33: State is set to SYN_ACK; 

34: GET_RTT = ts - firstTS', 

35: if {h.find{clientIP) returns true) then 

36: h.update{clientPort, [clientIP, serverIP, 3WHS_RTT, GET_RTT])', 

37: else 

38: The flow does not exist... error; 

39: continue; 
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Algorithm 2 Multiple Flow Collection with RTTs Using State and Packet Direction 
1: Open pc ap file p; 

2: Create map h for tracking flow RTT 5 ; 

3: Create map g for tracking flow State', 

4: Create map i for tracking flow timestamps', 

5: for each packet in p, read into buf along with timestamp ts do 
6: if (packet is not IP) then 

7: continue; 

8; Read the packet’s Ethernet and IP headers; 

9; if (protocol number is not TCP) then 
10: continue; 

11: Read TCP header; 

12: if (dstPort is equal to 80) then 

13: Direction is Outbound', 

14: if {g.fmdiiclientPort) returns false) then 

15: giclientPort] = START ; 

16: else 

17: Direction is Inbound', 

18: if {Direction is Outbound and g[clientPort] = START) then 

19: if (SYN flag is not set) then 

20: continue; 

21: giclientPort] = SYN_SENT ; 

22: i[clientPort] = ts; 

23: continue; 

24: if {Direction is Inbound and giclientPort] = SYN_SENT) then 

25: if (SYN and ACK flags are not set) then 

26: Packets are out of order, flow will not process; 

27: glclientIP] = SYN_ACK; 

28: 3WHS_RTT = ts - ilclientPort]; 

29: h.add{clientPort, lclientIP,serverIP,3WHS_RTT,0]); 

30: continue; 

31: 

32: Icontinue similarly to Algorithm 1]; 

33: 
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CHAPTER 4: 

RESULTS AND ANALYSIS 


4.1 Overview 

We collected 11 gigabytes of network traffic captures that were grouped into five sets. Three 
of the sets were selected from Alexa’s top one million websites. We selected 80 websites from 
the top 100, 100 websites from the 5,000 to 10,000 range, and 100 websites from the 500,000 
to 1 million range. We also had a group of known malicious websites and a list of reverse 
proxy websites, including one that we controlled. The following sections will describe how we 
collected the data, parsed all of the network traffic data into readable and analyzable formats, 
and then present the results and analysis to verify our hypotheses. 

4.2 Data Collection, Parsing, and Organization 

Once we finished collecting network traffic data, we knew we needed to have a concise ap¬ 
proach to storing flow results. The obvious requirements for storing flow data included the 
3WHS and GET RTTs along with a minimum of one data point to differentiate between flows. 
The differentiating value follows from our algorithm, using the client port to distinguish be¬ 
tween flows. When we write hundreds or thousands of flow results into a single data file, we 
need more unique flow factors for analysis than the three aforementioned features. We added 
the source and destination IP addresses so we could see from which PlanetLab node the traffic 
originated and the targeted website for each flow. In the cases where a flow had more than a 
single GET RTT because the same TCP connection was used to retrieve multiple objects, we 
kept a counter for the number of GET RTTs so that we could average the GET RTT time for 
a single flow. We used this average GET RTT for calculating flow RTT ratios, which we will 
discuss further in this section. 

4.2.1 Data Collection 

We used a bash script to collect our network traffic data from each PlanetEab Node. The 
script would read in IP addresses from a file passed in as an argument. The IP addresses were 
targeted by our script one at a time, sequentially executing the linux wget tool 250 times, then 
saving the network traffic for a single website per node for analysis. We used tcpdump to 
collect the network traffic. Each pcap file contained the network traffic from a single PlanetEab 
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node to a single website. We could then take each pcap file and analyze the results as an 
individual case. Furthermore, we could store the pcap files into groups depending on the filter 
we wanted to use. The types of filters that we may want to use to organize our data could be 
by PlanetLab node, website group, or website. We chose to break our data into website groups 
and PlanetLab Nodes. Overall, this resulted in 950,000 network traffic captures for the Alexa 
websites, 750,000 captures for the malicious websites, and 27,500 captures for known reverse 
proxies between June and September 2013. 

4.2.2 Reading Traffic Captures into Table Entries 

We implemented Algorithm 2 into a C++ program that analyzes a single pcap file and writes 
flow data to a file using comma-separated values (CSV) format. Each line in the CSV file 
represents a single flow. An example of this can be seen in Table 4.1, where the total row count 
equals the number of flows for that data file. We use a bash script to automate our C++ program 
across all pcap files from a PlanetLab Node within a single file directory. In this way, we can 
create a single data table for a pcap file that contains all flow results within that pcap. The way 
that we collected network traffic was between a single host and destination for each pcap file, 
initiating 250 web requests to each destination. 


Table 4.1: Data Storage for Network Traffic Capture Results - Flow per Row 


Client Port 

Client IP Address 

Server IP Address 

3WHS RTT 

# of GET RTTs 

GET RTT[I] 

44034 

169.226.40.2 

74.113.233.48 

0.015901 

1 

0.016551 

44037 

169.226.40.2 

74.113.233.48 

0.018353 

1 

0.016233 

43783 

169.226.40.2 

74.113.233.48 

0.015374 

1 

0.016882 








We only captured a maximum of 249 flows per pcap file because our network capturing pro¬ 
gram, tcpdump, took more time to begin collecting packets than we allowed before starting our 
web requests. The single missed flow is negligible. Some target websites showed connection 
challenges. We use our malicious website list and the Albany PlanetLab node as an example 
(Table 4.2). Only 221 flows were collected out of the 249 expected because a connection could 
not be established. Additionally, we found that data retrieval occasionally fails even though the 
initial connection is successful. The malicious website set showed a higher percentage of failed 
connections as well as failed GET requests when compared to the Alexa website lists. All sets 
typically had 1 to 5 failed GET requests per 250 requests. Manual inspection shows that the 
failed GET request flows had packet transfers as depicted in Eigure 4.1. 

There were several thousand CSV data files once we finished running our script and C++ 
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Table 4.2: Albany PlanetLab Node Traffic to Malicious Website: 178.73.224.97 


Client Port 

Client IP Address 

Server IP Address 

3WHS RTT 

# of GET RTTs 

GET RTT[1] 

58688 

169.226.40.2 

178.73.224.97 

0.105189 

1 

0.304336 

58944 

169.226.40.2 

178.73.224.97 

0.105583 

0 

- 

54337 

169.226.40.2 

178.73.224.97 

0.105192 

1 

0.303039 


program against all of our pcap files. These files provided a raw look at the paeket timing 
between a elient and server whieh we eould aggregate to get finer grain detail for analysis. In 
partieular, we wanted to be able to look at the eumulative distribution of 3WHS to GET RTTs 
from a elient to a server as a single data entry. Our solution was to use a ratio of the GET RTT 
to the 3WHS RTT. 


Client Server 



Figure 4.1: Packet Timing Diagram for Failed GET Requests 

4.2.3 Building Flow Sets 

We obtained the average GET RTT by using the number of GET RTTs eolumn eombined 
with the GET RTT timing values in the CSV data file. We divide the average GET RTT by 
the 3WHS RTT to have a single flow RTT ratio. This ratio should be roughly 1:1 for a direet 
eonneetion as we diseussed in Chapter 3. The expeeted ratio for a reverse proxy server would 
be greater than 1:1, but we had to analyze the known reverse proxy results in order to establish a 
threshold. Tables 4.3 and 4.4 show three separate entries in our summarized data file. A single 
entry has the identifying flow information with the elient and server IP addresses as well as the 
server domain name, if any, in the first three eolumns. If the server does not have a domain 
name, as is the ease for most of our malieious websites, it is labed as an ’undefined domain’. 

The fourth and fifth eolumns traek the total number of flows identified and the fraetion of 
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total flows that contained GET RTTs. This requires some explanation. We knew that 250 
web requests were initiated for each pcap file. If we identified only 240 flows that established 
connections, we would put 240 in the fourth column. We would take the total number of flows 
that had GET RTTs and divide that by the number in column four to get the fraction in column 
five. This gave us an idea of how many connections were actually established between the client 
and server, as well as the those connections that were able to successfully retrieve data with the 
server. 


Table 4.3: Data Storage for Cumulative Distribution of Flow Ratios for a 
Single Vantage point: Columns 1-5 


Client 

Server 

Server 

Total Number of 

Fraction of Total Requests 

IP Address 

IP Address 

Domain Name 

Collected Flows 

with a GET RTT 

169.226.40.2 

101.1.18.24 

uedl23.cn 

249 

1.0 

169.226.40.2 

108.162.198.182 

professionalcrew.net 

249 

1.0 

169.226.40.2 

109.69.186.62 

monsieur-meuble .com 

249 

1.0 


Table 4.4: Data Storage for Cumulative Distribution of Flow Ratios for a 
Single Vantage point: Columns 6-11 


Average Ratio 

Maximum Ratio 

lOth Percentile 

25th Percentile 

75th Percentile 

90th Percentile 

1.00850709519 

2.0050137043 

1.0565533638 

1.02593505383 

1.00303089619 

0.988574624062 

1.11531659039 

2.88584566116 

1.19612193108 

1.11458528042 

1.06560087204 

1.03861892223 

1.08726804611 

13.0100908279 

1.0081192255 

1.00470817089 

1.00184178352 

0.999478459358 


The sixth column is calculated by averaging all of the RTT ratios for the pcap file. The 
maximum ratio is identified by sorting all of the ratios in ascending order in an array and taking 
the value in the largest array index. The eighth through eleventh columns are important for our 
flow analysis in the next section. We take our sorted ratios array and identify the index in the 
array that correlates with the percentile we want. The percentile values allow us to analyze the 
ratio spread across all requests to a given website from a single client. 
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Our flow analysis is broken into two sections. The first section will cover our findings look¬ 
ing at the website sets holistically. We will show how the results match up with our hypotheses 
in Chapter 3 regarding set correlation or divergence. The second section will be the analysis for 
our reverse proxy threshold and the statistics for websites that match this threshold. Lastly, we 
will cover several detailed case studies on websites that show interesting signatures. 

4.3 Website Group Results 

In this section we present the results and analysis for each website group holistically. We 
start with the Alexa website groups, leading off with Alexa’s top 100 websites. The final group 
we analyze is the malicious websites group. The results for this section show us a high-level 
view for timing ratio results across all website groups, and allow us to identify the percentages 
of websites within each group that may be identifiable as a reverse proxy. The percentile results 
for each website group provide a filtered estimation for timing ratios that we can use to compare 
with our known reverse proxy results in the next section. Overall, the vantage points were 
consistent for every group which carried through to the case studies. A single outlier existed 
with Singapore’s PlanetLab node results against f eedblitz. com which we covered in a case 
study. 

4.3.1 Alexa’s Top 100 Websites 

The Alexa Top 100 websites include major search engines (i.e., Google, Yahoo, Baidu), 
online commercial companies (i.e., Amazon, eBay, Apple), and social networking or blogging 
sites (i.e., Blogspot, Facebook, Twitter). Over 20 of the top 100 websites resolve to the same 
IP address, so we only use 79 of the websites from the top 100. To be consistent with our 
referencing, we will continue to refer to this group of websites as the Alexa Top 100. 

The top 100 Alexa Websites data contained a total of 191,754 flows. 183,892 flows had GET 
RTTs that we used to calculate RTT ratios, equaling 95.9% of what we initiated as web requests. 
Figure 4.2 depicts the 10th, 25th, 50th, 75th, 90th, and 100th percentiles as CDFs. Referring to 
Table 4.5, we can see the first value for each percentile CDF. The lowest curve represents the 
10th percentile whereas the highest curve represents the 100th percentile. The lower 90% of 
these flows had a ratio under 2.003. 


45 



1.0 


(/I 

5 


o 

o» 

<13 

c 

y 

Q) 

Q. 


0.8 


0.6 


0.4 


0.2 


°-6>. 


alexatopl00.dat 





-1 







/r / 

V / 

/ 
ijl j 

' / 







1 

1/7 







1 

/ 





- Uppe 

- 25% 

- 50% 

r 10% 







— 75% 

— 90% 

All Flows 

1 1 


25 0.50 1.00 2.00 4.00 8.00 16.00 32.00 64.00 

Ratio of GET/3WHS RTTs 


Figure 4.2: Alexa Top 100 Websites Cumulative Distribution of Ratios 


Table 4.5: Percentile Values for Alexa’s Top 100 Websites 


Average Ratio 

Maximum Ratio 

10th Percentile 

25th Percentile 

50th Percentile 

75th Percentile 

90th Percentile 

1.53292146618 

1307.11242676 

2.00307941437 

1.2427624464 

1.05298805237 

1.00675189495 

0.999756515026 


We appended the individual node results into a single data file formatted as Tables 4.3 and 4.4 
so we eould sort the data for easier observation. We refer to a single entry in this resulting file 
as a flow set. A flow set consists of all flows (up to 250 flows) collected from one PlanetLab 
vantage point and one destination website. We could see 788 distinct flow sets in the our results. 
We identify three websites that establish zero connections across all of the PlanetLab nodes. 
Each number shown in Table 4.6 shows the fraction of flow sets, out of the total 788, with at 
least a certain percentage of the flows having their GET-3WHS ratios greater than a candidate 
threshold value. 
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Generally speaking, a higher value of T corresponds to a higher confidence than the websites 
in the marked flow sets exhibit the use of reverse proxies from one or more vantage points. The 
same is true with respect to the value of X. We know that higher values of X refer to the higher 
ratios within flow sets so the percentage of flow sets out of 788 will be higher with lower values 
of X. For instance, when X is equal to 10%, we know we are looking at all flow sets’ largest 
10% of ratios. Conversely, if we look at high values of T and high values of X, we see smaller 
percentages of flow sets that meet the criteria because we are including the smaller ratios from 
all flow sets. We omit the lowest 10% of flows to counter possible server caching affects on our 
results. Though an X value of 100 that still meets a high T threshold is the strongest sign of a 
reverse proxy, the lower ratios may force a website to be discarded which is undesirable. 


Table 4.6: Percent of Top 100 Alexa Flow Sets within Ratio Thresholds 


# Flow Sets with at least 

X% flows having ratio > T 

Threshold 

T = 1.25 

Threshold 

T = 1.50 

Threshold 

T = 1.75 

Threshold 

T = 2.00 

Threshold 

T = 2.25 

X= 10 

245/788 = 31% 

158 / 788 = 20% 

124/788= 16% 

97/788 = 12% 

87/788= 11% 

X = 25 

185/788 = 23% 

126/788= 16% 

97/788 = 12% 

80 / 788 = 10% 

61 /788 = 8% 

X = 75 

144/788 = 18% 

97/788 = 12% 

73 / 788 = 9% 

54 / 788 = 7% 

41 /788 = 5% 

X = 90 

135/788 = 17% 

91 /788 = 12% 

67 / 788 = 9% 

42 / 788 = 5% 

32 / 788 = 4% 


A single table entry (Table 4.6) may be multiple counts of a single website but from a variety 
of vantage points. We can pair down the numbers in Table 4.6 by website to get the total number 
of websites given an RTT ratio threshold to be more exact. One particular website had the high¬ 
est ratios consistently across all percentiles, qq.com had average RTT ratios of 7.8 and higher 
across all PlanetLab nodes. If we factor in the prevalence of qq.com with Table 4.6, we can see 
that a single website can make up a large percentage of the table representation. Apple. com 
also had interesting results across all nodes. Upon further inspection, we used a legacy server IP 
address for Apple that redirected all traffic to a separate location. The fingerprint of this website 
and other similar websites in the Alexa Top 100 group will be compared to that of reverse proxy 
timing in the next section. Additionally, we will examine several websites such as qq. com in 
more detail in the case study examples. 
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Figure 4.3: Alexa 5,000 to 10,000 Websites Cumulative Distribution of Ratios 


4.3.2 Alexa’s Top 5,000 to 10,000 Websites 

The types of websites we see in the Alexa listing between ranks 5,000 to 10,000 inelude edu¬ 
cational institutions (i.e., Brigham Young University, Duke University, Education.com), news 
websites (i.e.. Sky News Arabia, Kemalist Gazete, Detroit News), and entertainment industry 
sites (i.e., HBO GO, Leo Vegas Online Casino, FreeRide Games). The variety of website themes 
is more diverse than the few types we noticed in the Alexa top 100. We can see a difference in 
RTT ratio results by comparing Figures 4.2 and 4.3. The larger 10% of RTT ratios for Alexa’s 
5k-10k group show a larger range of values (Figure 4.3) than the websites in Alexa’s top 100, 
which exhibit consistently low ratios except for the last 5% of total flows (Figure 4.2). More 
specifically, there are 127 flow sets in the Alexa 5k-10k group that have ratios of 2.0 or above 
whereas Alexa top 100 only has 97. 

Alexa’s top 5,000 to 10,000 websites data set had 249,918 flows. 246,917 flows had GET 
RTTs that we used to calculate RTT ratios, equaling 98.8% of what we initiated as web requests. 
Figure 4.3 depicts the 10th, 25th, 50th, 75th, 90th, and 100th percentiles as CDFs. Referring to 
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Table 4.7, we can see the first value for each percentile. The lowest curve represents the largest 
10% of RTT ratios whereas the highest curve represents the full collection of flows and RTT 
ratios. The lower 90% of these flows had a ratio under 2.001. 


Table 4.7: Percentile Values for Alexa’s 5,000 to 10,000 Websites 


Average Ratio 

Maximum Ratio 

10th Percentile 

25th Percentile 

50th Percentile 

75th Percentile 

90th Percentile 

1.607620017 

285.401397705 

2.00146484375 

1.1055123806 

1.0147652626 

1.0021417141 

0.995370566845 


Similar to our data sorting step in the previous subsection, we count 996 flow sets for our ten 
PlanetLab Nodes and the Alexa Websites between 5,000 and 10,000. Table 4.7 shows us that 
54 flow sets had a ratio higher than 2.25 when we look at the largest 90% of the 250 flows in 
that flow set. Phrased another way this means that across 90% of the flows between a PlanetLab 
node and a website the RTT ratio was higher than 2.25 for 54 total node to website pairs. 
The information provided gives us a starting point for identifying websites that may be reverse 
proxies since the majority of flows in a flow set consistently have high RTT ratios. There are 
four consistently high websites within the 54 / 996 grouping. We will discuss one of them in 
the reverse proxy fingerprinting section later in this chapter. 


Table 4.8: Percent of Alexa 5k-10k Flow Sets w/in Ratio Thresholds 


# Flow Sets with at least 

X% flows having ratio > T 

Threshold 

T = 1.25 

Threshold 

T = 1.50 

Threshold 

T = 1.75 

Threshold 

T = 2.00 

Threshold 

T = 2.25 

X= 10 

254 / 996 = 23% 

178/996= 18% 

155/996= 16% 

127/996= 13% 

114/996= 11% 

X = 25 

186/996 = 19% 

143 / 996 = 14% 

120/996= 12% 

102/996= 10% 

85 / 996 = 9% 

X = 75 

133/996 = 13% 

96/996= 10% 

83 / 996 = 8% 

67 / 996 = 7% 

59 / 996 = 6% 

X = 90 

123/996 = 12% 

90 / 996 = 9% 

75 / 996 = 8% 

58 / 996 = 6% 

54/996 =5% 


4.3.3 Alexa’s Top 500,000 to 1,000,000 Websites 

Alexa’s top 500,000 to 1,000,000 websites cover a large variety of themes. 41 of 100 web¬ 
sites in this group were international domain names (i.e., *.fr, *.ru, *.de). Furthermore, many 
of the remaining websites with a ".com" domain were listed in foreign languages (i.e., seoders- 
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lerim.com, ilheusinfo.com, eskisehir.com). When we compare Figures 4.3 and 4.4 we can see 
much less variation than when we compared Alexa’s top 100 to Alexa’s 5k-10k groups. The 
only notable difference between Alexa’s 500k to 1 million website group and Alexa’s 5k-10k 
group is that the top 10% and 25% of Alexa’s 500k to 1 million group have slightly higher ratios 
in 80% of the flows. 


alexalm.dat 



Figure 4.4: Alexa 500,000 to 1,000,000 Websites Cumulative Distribution of Ratios 


Alexa’s 500,000 to 1,000,000 websites data set contained a total of 238,165 flows. 235,545 
flows had GET RTTs that we used tocalculate RTT ratios, equaling 98.9% of what we initiated 
as web requests. Figure 4.4 depicts the 10th, 25th, 50th, 75th, 90th, and 100th percentiles as 
CDFs. Referring to Table 4.9, we can see the first value for each percentile. 


Table 4.9: Percentile Values for Alexa’s 500,000 to 1,000,000 Websites 


Average Ratio 

Maximum Ratio 

10th Percentile 

2Sth Percentile 

50th Percentile 

75th Percentile 

90th Percentile 

1.91173319044 

733.114562988 

2.29619431496 

1.17700457573 

1.02160072327 

1.00476419926 

0.999192893505 
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We counted 959 flow sets for the Alexa 500k to Im website group. Eskisehir. com did not 
allow any connections via our script, along with 3 other websites in our list. These 4 websites 
make up for 40 of the 41 lost connections out of 1000. We should have had 1000 given that 
we used 10 PlanetLab nodes to collect network traffic on 100 websites. Accuenergy.com, 
oreno .co.jp. Aerial?. com, Likvidaci ja-ooo. ru, and GetRentalCar. com dominated the 
largest ratios in this website group across all 10 nodes, accounting for 40 of the 68 flow sets 
using an RTT ratio threshold of 2.25. We do a case study on GetRentalCar.com in a later 
section because it had 9 out of 10 flow sets in this highest ratio threshold. 


Table 4.10: Percent of Alexa 500k-lm Flow Sets w/in Ratio Thresholds 


# Flow Sets with at least 

X% flows having ratio > T 

Threshold 

T = 1.25 

Threshold 

T = 1.50 

Threshold 

T = 1.75 

Threshold 

T = 2.00 

Threshold 

T = 2.25 

X= 10 

263 / 959 = 27% 

199/959 = 21% 

166/959 = 17% 

139/959= 14% 

117/959= 12% 

X = 25 

214/959 = 22% 

168/959 = 17% 

142/959 = 15% 

111 /959= 12% 

95 / 959 = 10% 

X = 75 

186/959 = 19% 

146/959 = 15% 

115/959 = 12% 

93 / 959 = 10% 

77 / 959 = 8% 

X 

II 

o 

170/959 = 18% 

141 /959 = 15% 

107/959 = 11% 

85 / 959 = 9% 

68 / 959 = 7% 


4.3.4 Malicious Websites 

The malicious websites that we targeted were based off of a continuously updated web¬ 
site, Malware Domain List [52]. This website provides malicious website entries with date 
of reported activity, domain name, IP address, description of malicious action, registrant, au¬ 
tonomous system number, and country of origin. The types of malicious activity include a 
variety of purposes; ransomware, trojans, exploit kits, malicious redirection to compromised 
websites, java exploits, viruses, and other similar functions. There are over 1000 websites that 
are tracked as active. We collected network traffic on 300-400 of the websites. The range is 
variable because of slower connections dependent on the PlanetLab vantage point. We are still 
collecting data on the remaining malicious websites in the list for future work. 

The malicious website data set contained a total of 463,664 flows. 404,315 flows had GET 
RTTs that we used to calculate RTT ratios, equaling 87.2% of what we initiated as web requests. 
The large percentage of failed requests may be attributed to websites being identified and taken 


51 





1.0 


malSites.dat 


1/1 

s 


(U 

Ol 

(□ 

4-< 

c 

0 ) 

a 

0/ 

Q. 




/y 








/// 

\ / 








7 

\ / 









i 




- Uppe 

- 25% 

- 50% 

r 10% 




1 



— 75 % 

— 90% 

All Flows 

1 1 


0.25 0.50 1.00 2.00 4.00 8.00 16.00 32.00 64.00 

Ratio of GET/3WHS RTTs 


Figure 4.5: Malicious Websites Cumulative Distribution of Ratios 


down my domain administrators, or relocated to avoid interruptions to malicious operations. 
Table 4.11 shows a much higher average ratio compared to the other website groups. Figure 4.5 
reflects that roughly 60% of all flows have an RTT ratio near 1. The last 20% of all flows, 
or those flows with the highest RTT ratios, are the set of websites we will analyze in the case 
studies in the later section of this chapter. 


Table 4.11: Percentile Values for the Malicious Websites 


Average Ratio 

Maximum Ratio 

10th Percentile 

25th Percentile 

50th Percentile 

75th Percentile 

90th Percentile 

2.08384281878 

1130.85083008 

3.09227824211 

1.27307891846 

1.02019119263 

1.0042399168 

0.999864161015 


52 




























We counted a total of 1851 flow sets for our malicious website group. The results in Ta¬ 
ble 4.12 show higher percentages of ratios meeting thresholds when compared to all other web¬ 
site groups (Figures 4.6, 4.8, and 4.10). More specifically, the largest 90% of flow sets using 
a threshold of 2.25 is 5% higher than the next highest website group (Alexa 500k to Im). The 
218 flow sets within the aforementioned parameters have a few interesting characteristics. 

The 149.47.0.0/16 subnet make up 78 of the 218 flow sets. Whols lookups for the 
149.47.0.0/16 network point to a web hosting company named Precipice [53]. The mali¬ 
cious website locations are geolocated to various positions around the United States (i.e.. New 
York City, NY, San Francisco, CA, Carson City, NV). The Precipice website is a plain website 
with little functionality. There are several pages that speak about the services and company but 
there aren’t any secure methods to sign up for service or contact the company. The contact page 
has a simple form that can be filled out and submitted to the company, though the page has no 
encryption. Overall, the Precipice company looks like a poorly managed web-hosting service 
that can be used by malicious actors without much scrutiny. Further research on the company 
background yields phishing reports under the Precipice AS [54] (AS36444). 


Table 4.12: Percent of Malicious Site Flow Sets w/in Ratio Thresholds 


# Flow Sets with at least 

X% flows having ratio > T 

Threshold 

T = 1.25 

Threshold 

T = 1.50 

Threshold 

T = 1.75 

Threshold 

T = 2.00 

Threshold 

T = 2.25 

X= 10 

531 / 1851 =29% 

427/ 1851 =23% 

359/ 1851 = 19% 

314/1851 = 17% 

291 / 1851 = 16% 

X = 25 

452/ 1851 =24% 

373/ 1851 = 20% 

310/1851 = 17% 

283 / 1851 = 15% 

267/ 1851 = 14% 

X = 75 

414/ 1851 =22% 

336/ 1851 = 18% 

276/ 1851 = 15% 

263/ 1851 = 14% 

237/ 1851 = 13% 

X = 90 

401 / 1851 =22% 

318/1851 = 17% 

269/ 1851 = 15% 

247/ 1851 = 13% 

218/1851 = 12% 


The majority of IP addresses within the highest threshold group geolocate to servers in the 
United States. Other countries within the same group include the United Kingdom, Sweden, 
and Russia. We will do further analysis on three of the websites within the highest threshold 
that show consistently high ratios across all percentiles later in this chapter. 

4.4 Reverse Proxy Identification 

We controlled a single reverse proxy server that was built on Apache [19] and located in 
Monterey, CA. We set up the internally connected website to be across the country to increase 


53 





delays. The inereased delay would give us a strong signature for our timing analysis to help with 
assessing an appropriate threshold. Reverse proxy eonneetions are unlikely to have sueh large 
distanees to data centers, but matching websites give us higher confidence in the fingerprint. 
To increase our data set for reverse proxies, we used a website that advertised the free use of 
"web proxies." These web proxies were provided to execute anonymous downloads through 
TorrentProxies [55] and its affiliates. Our intuition was that these websites were not reverse 
proxies, but forwarding proxies to allow anonymized web traffic to subvert illegal activities. 
Instead of discarding the websites, we added ten of the advertised proxies to our reverse proxy 
website list and ran our traffic captures. The following section will highlight the results of the 
reverse proxy tests and present our threshold analysis and comparison to motivate our choices 
for case studies. 

4.4.1 Reverse Proxy Website Group 
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Figure 4.6: Reverse Proxy Websites Cumulative Distribution of Ratios 

Once we conducted the initial tests against our reverse proxy lists we observed results unlike 
we expected. Referring to Figure 4.6, we can see that roughly 80% of all the flows had an 
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RTT ratio of 1. This would imply that our reverse proxies served as direet eonneetions. This 
is possible, but we believed it to be unlikely given our base ease experiments. Upon further 
analysis of the aetual network traffie for the websites from TorrentProxies [55], we eould see 
that the traffie was being handled as a direet eonneetion. We also tested several of the websites 
in a eommon web browser to see what website was rendered. The returned web page was 
an error from the proxy serviee provider stating that direet eonneetions to the server were not 
allowed. 

If we use one of the listed proxy addresses in our Internet options, we ean anonymously 
browse the Internet or download illegal eontent. The latter is the primary purpose for Torrent- 
Proxies [55]. These proxies were aetually forwarding proxies instead of reverse proxies. We 
had to limit our reverse proxy threshold analysis to the 10 flow sets against our own reverse 
proxy server. 


actualRP.dat 



Figure 4.7: Reverse Proxy Websites Cumulative Distribution of Ratios - True Reverse Proxy 


Figure 4.7 refleets the RTT ratios for an aetual reverse proxy server. We ean easily see that 
the bulk of RTT ratios are roughly 2. Table 4.13 shows the authentie reverse proxy pereentile 
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results. Though our set is limited, we have 2,490 flows against a known reverse proxy server 
with 100% success. We could set a threshold of 1.6 if we wanted to encompass all possible 
reverse proxy servers, or set a threshold T between 2 and 3 to discard anomalous RTT ratios. 
Our value of X needs to be 90% so we encompass the majority of flows while omitting the lowest 
ratios that may have been affected by caching. On the other hand, abnormally large ratios could 
be the result of bursty network delays that affect data packet transfers, legacy servers that have 
limited processing power for data packet transmission when under heavy loads, or possibly 
packet loss or retransmissions. 


Table 4.13: Percentile Values for the Known Reverse Proxy Server 


Average Ratio 

Maximum Ratio 

10th Pereentile 

25th Percentile 

SOth Percentile 

75th Percentile 

90th Percentile 

2.42821306276 

21.9782352448 

3.41394352913 

2.42637991905 

2.19787526131 

1.69977366924 

1.60702753067 


The comparison of threshold results between Table 4.14 and the other website groups is 
plain. We have 10 flow sets for the reverse proxy. We will need to create more reverse proxy 
servers of our own or identify known reverse proxy servers for future work to build upon our 
threshold identification. We want to use a threshold that is as close to 100% identification 
against known reverse proxies while not overlapping with known direct connections to maxi¬ 
mize true positives and negatives while minimizing false positives and negatives. 


Table 4.14: Percent of Reverse Proxy Flow Sets w/in Ratio Thresholds 


# Flow Sets with at least 

X% flows having ratio > T 

Threshold 

T = 1.25 

Threshold 

T = 1.50 

Threshold 

T = 1.75 

Threshold 

T = 2.00 

Threshold 

T = 2.25 

X= 10 

10/10= 100% 

10 /10 = 100% 

10/10= 100% 

6/10 = 60% 

6 /10 = 60% 

X = 25 

10/10= 100% 

10/10= 100% 

7 / 10 = 70% 

6 /10 = 60% 

5 /10 = 50% 

X = 75 

10/10= 100% 

10/10= 100% 

6/10 = 60% 

6/10 = 60% 

4 / 10 = 40% 

X = 90 

10/10= 100% 

9 /10 = 90% 

6/10 = 60% 

5 /10 = 50% 

2 / 10 = 20% 


Instead of identifying a statistical tool for identifying the best threshold value, we will at¬ 
tempt to maximize our truth values by using a high threshold. Additionally, we will match 
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website flow sets across all 10 PlanetLab nodes to identify websites that consistently have RTT 
ratios above the threshold we set. This will help omit vantage point influence with our finger¬ 
printing. We provide a table in the next subsection to help motivate our choices for the case 
studies by using the two aforementioned discriminators. 

4.4.2 Websites that match Reverse Proxy Threshold 

In Table 4.15 we identify those websites within each website group that exhibit high RTT 
ratios across at least 6 of the 10 flow sets for that website. We look at the upper 90% of flows 
for each flow set. The domain name and IP address are both provided if available. The optimal 
fingerprint would have a score of 10/10 with a high threshold. The fingerprint is weaker as the 
number of 10/10 threshold values in Table 4.15 decreases. We identify the lowest threshold that 
has 10/10 for a website as a baseline fingerprint because that represents the best performing 
flow set for the website (i.e., highlighted cells in Table 4.15). 

The motivation for desiring 10/10 within a high threshold is based on the expectation that 
consistently high ratios are a stronger indicator of a reverse proxy. If a particular website shows 
even a single flow set within a low threshold, we have proof that the target website is capable of 
servicing fast RTTs for data. The remaining flow sets at higher thresholds then have less value 
because they could be based on various network delays resulting from vantage point locations. 

We want to choose a single website within each set that exhibits behaviors similar to reverse 
proxies that we can examine further. Apple . com did not show strong threshold results in Ta¬ 
ble 4.15 but we were given information about the server IP address that resulted in interesting 
values. In the Alexa 5k to 10k group we choose f eedblitz. com because it has a strong thresh¬ 
old value across all PlanetLab Nodes. Oreno .co.jp had the highest threshold match within the 
Alexa 1 million website group. Lastly, we examine the top three malicious website threshold 
matches to investigate the possibility of reverse proxies in known malicious domains. We also 
compared the case study websites against each other to highlight similarities. 
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Table 4.15: Highest Ratio Websites per Website Group and their Corre¬ 
sponding Fraction of Flow Sets per Ratio Threshold 


Website 

Group 

Domain name / 

IP Address 

Threshold 

T = 1.25 

Threshold 

T = 1.50 

Threshold 

T = 1.75 

Threshold 

T = 2.00 

Threshold 

T = 2.25 

Alexa Top 100 

apple.com 

17.149.160.49 

9/10 

6/10 

3/10 

2/10 

2/10 

Alexa Top 100 

amazon.com 

205.251.242.54 

9/10 

7/10 

5/10 

3/10 

3/10 

Alexa Top 100 

flickr.com 

68.142.214.24 

|10/ 11^ 

9/10 

9/10 

7/10 

6/10 

Alexa Top 100 

qq.com 

125.39.240.113 

10/10 


6/10 

1 /10 

1 /10 

Alexa 5k - 10k 

feedblitz.com 

74.208.13.17 

10/10 

10/10 

10/10 

10/10 

:I07T^ 

Alexa 5k - 10k 

ujian.cc 

210.14.128.141 

10/10 

10/10 

iTOTTQ 

3/10 

2/10 

Alexa 5k - 10k 

skynewsarabia.com 

94.200.220.238 

10/10 

10/10 

10/10 

10/10 

10713^ 

Alexa 5k - 10k 

bloggertipstricks.com 

162.144.5.3 

10/10 

|T^ 

8/10 

6/10 

6/10 

Alexa 500k - Im 

getrentalcar.com 

217.23.2.151 

10/10 

10/10 

10/10 

TOTTfl 

9/10 

Alexa 500k - Im 

oreno.co.jp 

158.199.163.43 

10/10 

10/10 

10/10 

10/10 

10“713^ 

Alexa 500k - Im 

accuenergy.com 

204.15.197.140 

10/10 

10/10 

10/10 

,Tcrng 

9/10 

Alexa 500k - Im 

likvidacija-ooo.ru 

50.87.150.6 

10/10 

ilOTTi 

9/10 

7/10 

6/10 

Malicious 

(none) 

129.121.160.162 

10/10 

10/10 

10/10 

10/10 

107131 

Malicious 

(none) 

149.47.152.172 

10/10 

10/10 

10/10 

10/10 

|10/I^ 

Malicious 

(none) 

149.47.155.172 

10/10 

10/10 

10/10 

10/10 

TOTTII 

Malicious 

(none) 

149.47.132.173 

10/10 

10/10 

10/10 

Tom 

9/10 
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4.5 Case Studies 

The case studies in this section are a deep dive of the target website results across all 10 
PlanetLab nodes. The RTT ratio CDFs, the percentiles, and examination of the actual pcap files 
are presented to help reinforce our hypothesis and answer some of our research questions from 
Chapter 1. Each website was chosen because its results were distinct amongst its website group. 

In the case of the malicious websites, we present three separate case studies since our data set 
was larger for malicious domains, and these targeted websites are a focus for this research. 

4.5.1 Alexa Top 100 Case Study - Apple.com 

Apple’s website results were chosen for the Alexa top 100 case study because we had ac¬ 
cess to a reliable source of information regarding how Yahoo is hosted. We used the URL "ap¬ 
ple.com" to do a domain name service lookup that resolved the IP address we used (17.149.160.49) 
to collect our data. Our reliable source of information was that the URL http: //apple . com 
was a legacy server, and that the server hosted no content. The server merely redirected traffic 
to the appropriate Apple Webserver based on the type of traffic (i.e., http: //www. apple. com 
for HTTP on port 80). 

We performed a series of manual inspections on the network traffic generated between the 
client and the legacy server and the client to http://apple.com. We did a DNS lookup to 
get the IP addresses that are listed for "apple.com" are 17.172.224.47, 17.149.160.49 (the IP 
address we use to collect network traffic), and 17.178.96.59. The DNS lookup results against 
WWW. apple. com are different. If a client attempts to connect to http: //apple . com, the net¬ 
work traffic will show an "HTTP/1.1 301 Moved Permanently" response, redirecting the client 
to an IP address for one of Apple’s web servers. The HTTP 301 response is different from 
the previously explained HTTP 302 response in that it tells the client to not revisit the URL/IP 
address in future requests, but it is up to the client whether or not it follows this response. These 
three legacy servers serve the same web content as http: //www. apple . com if the web client 
connects to one of the three legacy server IP addresses in the web browser or uses a command 
line tool such as wget. The web content is possibly cached at each of these legacy servers for 
web clients that force HTTP requests to one of the three IP addresses, or the legacy servers are 
configured as reverse proxy servers. 
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Figure 4.8: Apple.com Cumulative Distribution of Ratios 


Table 4.16: Percentile Values for Apple.com 


Average Ratio 

Maximum Ratio 

10th Pereentile 

25th Pereentile 

50th Percentile 

75th Percentile 

90th Percentile 

4.20791528898 

445.687316895 

9.60596847534 

2.8955848217 

1.72469222546 

1.38438487053 

1.28809475899 


Referring to Table 4.16 and Figure 4.8, we ean see that roughly 40% of all flows have a flow 
set ratio of 1.75 or less. These flow sets are a eombination of slower 3WHS RTTs and eompar- 
atively fast GET RTTs that result in a lower RTT ratio. These RTT ratios can be misleading in 
our findings because, though we may have a reverse proxy, spikes in network delay during the 
3WHS can cause our ratio results to look like direct connections. When we look at larger per¬ 
centages of total flows, we can see that at least 80% of all flows have RTT ratios of above 2.0. 
The legacy servers are slower to respond with data yet still send an HTML 200 OK response, 
whereas Apple’s primary web servers have quick response times with ratios near 1.0 with the 
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same HTML 200 OK response. The slower response times with matching data retrieval suggests 
that the URL http://17.149.160.49 is serving as a reverse proxy when a client directly 
connects to it. 

It is possible that these legacy servers serve as reverse proxies to the actual web servers or 
other Apple data servers when web clients connect directly to the legacy server IP addresses. 
Another possibility is that the servers are not built to handle network load for serving content 
to clients but more suited for redirection as we can see with the DNS resolution tests so the 
processing delay for data service leads to high RTT ratios, but we believe that the processing 
delay is a less likely scenario. We assess that processing delay would vary over time based on 
network load so that the range of ratios would be large, but our ratios are mostly above 1.75 
across all vantage points. Our conclusion on this particular case study is that the legacy servers 
provide redirection when web clients mistakenly type the URL http: //apple. com into their 
web browser, but serve web content as reverse proxies when web clients use their network IP 
addresses directly. 

4.5.2 Alexa 5k to 10k - feedblitz.com 

We wanted to verify that the URL http: //f eedblitz. com resolved to the same IP address 
as the URL http: //www. f eedblitz. com because of the findings in the Apple case study. The 
list of IP addresses returned for both NS lookups contained the IP address we used to generate 
network traffic (74.208.13.17). Feedblitz advertises itself as an automated marketing service for 
email and social media [56]. 


Table 4.17: Percentile Values for feedblitz.com 


Average Ratio 

Maximum Ratio 

10th Pereentile 

25th Percentile 

SOth Percentile 

75th Percentile 

90th Percentile 

21.5330613914 

93.4940490723 

36.8966827393 

29.5751972198 

22.8499794006 

9.2367067337 

7.84833526611 


The results referenced in Figure 4.9 and Table 4.17 are interesting because the threshold 
values across all nodes remain high. Furthermore, all flows exhibit RTT ratios of at least 4.0. 
When we look at the network traffic in a single HTTP request to Feedblitz, it is easy to notice 
the large time disparity in the 3WHS and GET RTTs. The HTML response for each web request 
states that the server is running Microsoft IIS/7.0, and states several extra fields reflecting the 
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feedblitz.dat 



Figure 4.9: feedblitz.com Cumulative Distribution of Ratios 


use of ASP.NET and XML-RPC, a remote proeedure eall using XML over HTTP. Blogging 
websites use XML RPC and PHP over HTTP to notify externally linked websites that they 
were refereneed within a served page’s eontent. In eaeh of the Alexa website groups, blogging 
sites regularly had high RTT ratios. It is possible that XML RPC servieed webpages may be set 
up as reverse proxies. 

When we use the Whols serviee to look up the feedblitz IP address, we get information 
about l&I Internet Ineorporated, a web hosting serviee out of Chesterbrook, PA who owns 
that IP address. They offer a seleetion of servers for hosting websites, whieh is likely where 
our interesting results originate, l&l Internet Ine. offer three types of servers; virtual servers, 
dynamie eloud servers, and dedieated servers [57]. The delays we observe in getting the data 
is likely beeause of an Internet-faeing web server used by l&l Internet Ine. to proxy their web 
serviees. 

The eonneetion to I&I’s Internet-faeing server is quiek, but when we request feedblitz. com, 
the server is required to find the appropriate data on whiehever server feedblitz. com is utiliz- 
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ing, obtain the data, and forward the data from one of their internal servers on to the requesting 
client. We assess it is unlikely that f eedblitz. com is using a dedicated server because the re¬ 
sponse times would be significantly faster, f eedblitz. com is likely using a virtual or dynamic 
cloud server which is why we see such a large delay in data response. The data timing delays are 
consistent with reverse proxies, and we therefore assess that feedblitz. com is being hosted 
behind a reverse proxy at l&l Internet Incorporated. 

4.5.3 Alexa 500k to 1 Million - oreno.co.jp 

Oreno.co.jp is a Japanese restaurant website. It uses XML RPC and PHP web pages 
similar to Feedblitz in the Alexa 5k to 10k case study. An interesting phenomenon with this 
website occurs when requesting the website via the domain name in a web browser or HTTP 
request tool (i.e., wget). The web traffic to http://oreno.co.jp shows the IP address as 
158.199.163.43, but returns an “HTTP/1.1 301 Moved Permanently” response. The response 
redirects the client to the exact same IP address and TCP port with a separate domain name 
(http: //www. oreno . co . jp). If we attempt to connect directly to the IP address, we do not 
receive the HTTP 301 response. We collected all of our network traffic by connecting directly 
to the IP address so the HTTP 301 response did not impact our results. 

The strange behavior with the domain name responses may be indicative of a reverse proxy 
server because the domain name service would resolve both domain names (with or without 
the "www") to the same IP address. A reverse proxy may be configured to only retrieve web 
content for a specific HTTP header that includes the "www" in the request. When we use the 
IP address as a hard-coded destination, the domain name service is bypassed so we do not run 
into the same problem. 

There were similar RTT ratios for this website compared to Feedblitz. We had an extreme 
case with our Singapore PlanetLab node that had RTT ratios greater than 198 for 90% of all 
flows. If we omit this particular node, we still see consistently high RTT ratios across the re¬ 
maining nodes that are at least 2.77 for 90% of all flows. The website uses javascript, cascading 
style sheets, XML, and PHP. The Whols results for http: //oreno .co.jp fall into the own¬ 
ership of KDDI Web Communications Incorporated, a web-hosting service out of Japan [58]. 
We expect that the configuration of KDDI’s Internet-facing IP address is similar to that of l&l 
Internet Inc. for feedblitz. com. Though KDDI does not advertise server types on their web¬ 
site, they do mention the high need for Internet domains for Japanese businesses but with a low 
supply. They offer their hosting services as a solution. If they are hosting over 40,000 corporate 
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Figure 4.10: Oreno.co.jp Cumulative Distribution of Ratios 

clients [58], it is likely they have a large set of data eenters to store all of the content. Clients 
will easily eonneet to the Internet faeing web server, but then have to wait for the data to get 
transferred from an internal data server baek out to the elient. The delay is that of a reverse 
proxy. 


Table 4.18: Percentile Values for Oreno.co.jp 


Average Ratio 

Maximum Ratio 

10th Percentile 

25th Percentile 

50th Percentile 

75th Percentile 

90th Percentile 

27.4278299519 

283.344116211 

117.103462219 

4.58236169815 

3.92079305649 

3.65683436394 

2.99382328987 


We believe http: //oreno .co.jp and KDDI Web Communieations showed definitive signs 
of reverse proxy eharaeteristies. The eonsistently high RTT ratios are the strongest indieator. 
The larger then Internet beeomes, the more web hosting services must utilize the larger range of 
domain names to make up for the redueed IP address spaee. DNS allows web hosting serviees 
like KDDI to utilize internal data storage and reverse proxy servers to handle web requests. 
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4.5.4 Malicious Website One - 129.121.160.162 

Malicious website one (129.121.160.162) is more speeifieally refereneed at MalwareDo- 
mainList.eom [52] as a Blaekhole exploit kit. The website is hosted in the United States by 
Oso Grande Teehnologies in Albuquerque, New Mexieo. The blaekhole exploit uses obfus- 
eated javaseript within a webpage to identify features of the elient’s maehine then eompromises 
the maehine based on the results. The webpages are most likely used for short periods of time 
under the expeetation that the webpages will eventually be shut down onee they are identified as 
malieious. Blaekhole exploit kits have Russian origins. The kits made up 41% of web attaeks 
in 2012 as reported by the Symantee Corporation [59]. 


1.0 


0.8 


lo.e 


0) 

to 

§ 0.4 
<u 

CL 


0.2 


129.121Raw.dat 


^■ 8 . 


1 1 ~ 

— Upper 10% 

— 25% 

— 50% 







- < 

i 

75% 

90% 

'ill Flow; 







( 

i 







I _ 

A 






/ 

1 

// 

^ 1 

f 







J 

1 

L 


1 1 


.25 0.50 1.00 2.00 4.00 8.00 

Ratio of GET/3WHS RTFs 


16.00 32.00 


64.00 


Figure 4.11: 129.121.160.162 Cumulative Distribution of Ratios 

Refereneing Figure 4.11 and Table 4.19, we ean see that the RTT ratios are eonsistently 
high aeross all PlanetLab nodes. When we visit the website, we get a webpage that looks 
like a simple blog foeused on data merging and smoking eigarettes. Shortly after the page 
loads, an advertisement pops up for eigarette sales. The link for the advertisement goes to 
theecig.net/v2savemore. MalwareDomainList.com [52] provides the IP address as a mali¬ 
eious website, but the URL for the malieious exploit is at 
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http://129.121.160.162/f2bca24cbac8843a780d35653bfe8548/q.php. The aforemen¬ 
tioned page no longer exists, so we are not able to run the exploit in a sandboxed environment. 
Based on the timing results, we assess that this website is behind a reverse proxy. 


Table 4.19: Percentile Values for 129.121.160.162 


Average Ratio 

Maximum Ratio 

10th Pereentile 

25th Percentile 

SOth Percentile 

75th Percentile 

90th Percentile 

39.1598096188 

670.701721191 

69.3232040405 

57.3231506348 

25.9222316742 

13.6023540497 

8.92593288422 


4.5.5 Malicious Website Two -149.47.152.172 

Malicious website two (149.47.152.172) is also referenced as a Blackhole exploit kit, but it 
is under version 2.0 [52]. The website is hosted by Precipice Webhosting services out of New 
York City, New York. We can see in Table 4.20 that the RTT ratios for this website are much 
lower than that of malicious website one (Table 4.19). The RTT ratios are consistently above 
3.0 across all PlanetLab nodes, with the lowest around 2.2 for 90% of all flows across all nodes. 

We were unable to make connections to either the IP address directly or the URL ref¬ 
erenced as malicious at MalwareDomainList.com [52]. The malicious URL string is listed 
as http://149.47.152.172/Web/links/replacement-based_destroy-varies.php. We 
could not conduct any further analysis of the actual network traffic to the site beyond what our 
initial traffic depicted. The RTT ratios suggeste that this is also a reverse proxy as the RTT 
ratios were consistently high across all PlanetLab nodes. 


Table 4.20: Percentile Values for 149.47.152.172 


Average Ratio 

Maximum Ratio 

10th Percentile 

25th Percentile 

50th Percentile 

75th Percentile 

90th Percentile 

8.38564520167 

30.7004337311 

16.1830101013 

12.2755508423 

6.76642513275 

3.60729312897 

3.35419392586 


4.5.6 Malicious Website Three - 149.47.155.172 

Malicious website three (149.47.155.172) exhibits the same behaviors as malicious website 
two. Furthermore, it is reported to be using the same exploit (Blackhole Tool Kit 2.0) [52]. The 
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Figure 4.12: 149.47.152.172 Cumulative Distribution of Ratios 


HTML returned from the network traffic includes the URL http: //vitashoesco. com/. We 
attempted to connect to this website, the IP address we got from Malware Domain List, and 
the URL reported as an exploit. None of the URLs were still active. In the same fashion as 
the previous two malicious sites that showed similar behavior, we reference Figure 4.13 and 
Table 4.21 as evidence that the server was using a reverse proxy to hide the backend malicious 
behaviors. 


Table 4.21: Percentile Values for 149.47.155.172 


Average Ratio 

Maximum Ratio 

10th Pereentile 

2Sth Percentile 

SOth Percentile 

75th Percentile 

90th Percentile 

8.79291193862 

111.13899231 

17.8608493805 

12.6154546738 

7.89357471466 

3.61290693283 

2.99382328987 
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Figure 4.13: 149.47.155.172 Cumulative Distribution of Ratios 
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CHAPTER 5: 

CONCLUSIONS AND FUTURE WORK 


We present our conclusions in this chapter based on the results and analysis against Alexa 
website groups and malicious websites. We identify the strengths and research contributions of 
our experiments as well as the shortfalls. These shortfalls serve as motivation to improve our 
established methodology in future work. 

5.1 Conclusions 

We developed a methodology to actively probe targeted websites across a set of vantage 
points, gather the network traffic from the HTTP requests, and analyze the RTT results from 
HTTP sessions to identify reverse proxies. The vantage points were globally distributed to 
check RTT consistency and increase the set of network traffic we could analyze. We looked 
at three sets of selected websites using Alexa’s website ranking system [50]. The three sets 
covered the most popular sites, mid-grade popularity sites such as educational institutions and 
entertainment providers, and lower popularity sites such as foreign websites, small business 
e-commerce, and web blogs. We also gathered a large set of flows against known malicious 
websites using a database that was current [52]. 

Our analysis was based on the ratio of the GET RTT over the 3WHS RTT. This ratio gave 
us a simple value that we could compare against our known set of reverse proxy servers. We 
established a range of threshold values based on the known reverse proxy ratios that we then 
compared across all website groups. We discovered that large RTT ratios existed in all types 
of website groups. Approximately 15-20% of all website groups that we analyzed showed 
reverse proxy-like timing ratios. The ratio marker indicates that data-centric delays in HTTP 
connections apply to high performance websites, blogs, and even malicious websites. 

We claim that our ratio marker could be caused by several factors. Processing delays in 
the lower performance websites could cause slow response times for data when under heavy 
traffic loads. Medium-level popularity websites are likely hosted behind CDNs, web-hosting 
services, or possibly reverse proxies of their own. The most popular websites likely use CDN 
technology and their own web-hosting servers that may include reverse proxies. Malicious 
websites are likely using compromised web-hosting services or home workstations with built- 
in reverse proxies to hide their activities, and the larger RTT ratios could be a combination of 
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an aforementioned factor with processing delays because of network edge web hosting. 


The Alexa Top 100 website group showed high performance results given that the majority 
of RTT ratios were under the lowest ratio threshold we established (1.25). The Alexa 5k to 10k 
website group had a larger disparity between the low ratios and the upper ratios (See Fig. 4.3) 
when compared to Alexa’s top 100 website group. Similarly, Alexa’s 500k to 1 million website 
group had roughly 75% of all flows with a ratio of 1.25 or less. The malicious websites were 
the closest in comparison to Alexa’s 500k to 1 million group, exhibiting consistently low ratios 
except that the difference between the averages and the upper 10% of ratios was large, hence 
the long-tailed CDF. 

Without having a way to verify our results, we cannot say with 100% certainty that our 
assessed reverse proxies are, in fact, reverse proxies in reality. However, the test cases we 
performed using manual packet inspection all suggest our methodology is highly accurate. We 
did see some interesting results in our case studies. Apple.com and oreno.co.jp provide 
redirection when “www” is not included in the URL. Apple redirected to a different server, 
though if a client connected to the IP address of apple. com the client would still get the same 
website as www. apple . com. Oreno .co.jp would redirect the client to the same IP address and 
the same port, which seems like a misconfigured Webserver but could be indicative of a reverse 
proxy. 


5.1.1 Strengths and Contributions 

The greatest contribution of our research is the application layer fingerprinting methodology. 
We discovered that HTTP could be used to reveal indistinct characteristics of servers using the 
timing of each packet in an HTTP request. The time required for a TCP session to be established 
is roughly equivalent for most HTTP requests and their corresponding responses. We showed 
that a ratio of the GET RTT over the 3WHS RTT from a TCP session could be used with 
known timing thresholds of a reverse proxy to identify the server type. The methodology we 
created can be used from any client to any server that supports HTTP traffic. The modularity 
and simplicity of our approach can be correlated with other known fingerprinting techniques to 
have higher levels of confidence in networking signatures as well. 

Secondly, we performed a number of small experiments throughout this research to carefully 
validate our methodology and expectations. We built a series of small scale tests using servers 
we controlled in order to manually inspect packet dynamics from the client and the server van- 
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tage points. Developing timing diagrams based on the resulting network traffic helped us realize 
that Karn’s algorithm would not be sufficient, which forced us to discard popular traffic analysis 
tools widely used on the Internet. Furthermore, the small experiments and class projects gave 
us helpful experiences that led us to approach data collection and organization in more efficient 
ways. 

The frequent use of scripting helped us automate mundane tasks such as data organization, 
data collection, graphing, and table building. We were able to quickly create scripts using bash 
and/or python to meet most of our needs. We had a few scripts that were more robust, such as the 
data collection and graphing scripts. We were able to customize these scripts using parameters 
or some manual changes to the syntax to automate valuable steps in our research. Python’s 
matplotlib and numpy modules were significant in helping us graph CDFs across large data sets 
in our python scripts, and saved us a lot of manual effort. 

The number of HTTP requests submitted per website across each node gave us sufficiently 
large data sets. We believe our data sets were big enough to outweigh most anomalous behavior 
in network delays that could have swayed our results. The number of requests could be in¬ 
creased, but the results would likely remain consistent. In fact, we could have likely performed 
100 or fewer HTTP requests instead of 250 and still gained the same insights that we did in our 
research. 


5.1.2 Shortfalls 

The number of known proxy servers is the most glaring shortfall in our research. We ex¬ 
ecuted HTTP requests across hundreds of popular and malicious websites but had only one 
known reverse proxy server by which to compare. We were able to make assessments on fin¬ 
gerprinted websites but the training set must be larger in future work. This is imperative so that 
an accurate threshold can be set. 

Domains resolve to different IP addresses depending on the vantage point, which can cause 
inconsistencies in RTT comparisons across global vantage points. To work around this issue, 
we chose a single vantage point (Monterey, CA) to pull IP addresses. These IP addresses were 
used to execute the HTTP requests instead of domain names. DNS lookups can return CDN 
servers, legacy servers, false destinations, or fail altogether depending on the input. When we 
searched for the IP address of Apple.com, we searched apple. com instead of www. apple. com. 
The former provides a group of IP addresses to legacy servers. The latter returns a group of IP 
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addresses to current websites for the Apple corporation. 


The IP address lists in future work should be inspected thoroughly and then put into sub¬ 
groups. An example could be the websites in the Alexa top 100. If a website IP address points 
to a CDN server such as Akamai, it is put into a "known CDN" subgroup. Alternatively, the 
website IP address could be put into "known direct connection" or "unknown" as well. This 
would help with data organization and analysis. 


5.2 Future Work 

Several research questions we wished to answer circled around building a machine learning 
classification mechanism, which we did not accomplish in our work. We are currently working 
on a classifiication solution for future projects. We were unable to answer the following research 
questions in this thesis: 

1. Given a set of non-proxy webservers, their IP addresses, a set of known reverse proxy 
webservers and their IP addresses, can an automated Internet request program be trained 
to classify unknown servers as non-proxy webservers or reverse proxy webservers? If so, 
what is the confidence interval for classifying unknown servers with the trained program? 

2. What is the false positive rate and false negative rate for the classifier results? The F- 
Score? 

3. Is the timing data equivalent regardless of the HTTP request format (GET, POST, etc.)? 

4. How does caching affect reverse proxy RTTs and how can we detect caching? 

5. What is the time granularity required to differentiate 3WHS RTTs from GET RTTs? Are 
the current clock resolutions sufficient? 

6. Can we differentiate between reverse proxy servers and CDNs with a classifier? What 
level of precision do we need in our threshold to reduce false positives and false negatives 
concurrently? 

We recommend testing this approach against web servers that have system administrators 
who can be contacted to verify the presence of a reverse proxy. Our method was not verified 
for any of our research aside of one website which makes proving our methodology’s accuracy 
and precision difficult to quantify. Additionally, more global vantage points should be used to 
check domain names to IP addresses, increase data collection for more comparative analysis, 
and widen the scope of possible RTT ratios. The vantage points are affected by some delays 
more than others based on distance to the Webserver, network path, network infrastructure, and 
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so forth. As the number of vantage points increases, so does the confidence in fingerprinting 
websites because the RTT ratios can be calculated using a diverse set of delays. We used mostly 
IP addresses that were located in the United States. We want to use a much larger variety of 
vantage points in future testing, specifically spreading our selection of vantage points around 
the globe to maximize the diversity of network delays that affect timing. Additionally, the 
results may be more varied if we performend DNS lookups from a Russian or Chinese vantage 
point. Though our data set was large enough for thesis research, a larger set of flows with more 
delineated website groups could be easier to analyze as well. 

A classifier that can perform machine learning techniques against websites is the end goal. 
We would like to refine our code base to train against known reverse proxies and direct con¬ 
nections to build statistically accurate thresholds then classify websites. A third server group 
could be CDNs, whose ratios may fall in between the timing thresholds of reverse proxies and 
direct connections. We could then identify the true statistical value of our approach, identify 
weaknesses, and fix those weaknesses or fill the gap with another fingerprinting technique to 
strengthen the overall classifier. 
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